Noninvasive diagnostics by sequencing 5-hydroxymethylated cell-free dna

ABSTRACT

Provided herein is a method of sequencing hydroxymethyated cell-free DNA. In some embodiments, the method comprises adding an affinity tag to only hydroxymethyated DNA molecules in a sample of cfDNA, enriching for the DNA molecules that are tagged with the affinity tag; and sequencing the enriched DNA molecules.

CROSS-REFERENCING

This application claims the benefit of U.S. provisional application Ser.No. 62/319,702, filed Apr. 7, 2016, 62/444,122, filed Jan. 9, 2017, and62/461,712, filed Feb. 21, 2017, which applications are incorporated byreference in their entirety.

BACKGROUND

DNA modifications in the form of 5-methylcytosine (5mC) and the recentlyidentified 5-hydroxymethylcytosine (5hmC) represent the two majorepigenetic marks found in mammalian genome and they impact a broad rangeof biological processes from gene regulation to normal development.Detecting aberrant 5mC and 5hmC changes in the cell-free DNA (cfDNA) mayrepresent an attractive noninvasive approach for cancer diagnostics.cfDNA is the circulating DNA found in our blood originated fromdifferent tissues and has been utilized for noninvasive prenatal tests,organ transplant diagnostics, and cancer detection. Compared theintensive research on cell-free 5mC DNA as a biomarker for cancerdiagnostics, cell-free 5hmC DNA has remain unexploited, mostly due tothe low level of 5hmC compared to 5mC in the human genome (10 to100-fold less than 5mC) and the lack of a sensitive low-input 5hmC DNAsequencing method to work with the minute amounts of cfDNA (typicallyonly a few nanograms per ml of plasma).

SUMMARY

Provided herein, among other things, is a method of sequencinghydroxymethyated DNA in a sample of circulating cell-free DNA. In someembodiments, the method comprises adding an affinity tag to onlyhydroxymethyated DNA molecules in a sample of cfDNA, enriching for theDNA molecules that are tagged with the affinity tag; and sequencing theenriched DNA molecules.

In some embodiments, the method comprises: adding adaptor sequences ontothe ends of the cfDNA; incubating the adaptor-ligated cfDNA with a DNAβ-glucosyltransferase and UDP glucose modified with a chemoselectivegroup, thereby covalently labeling the hyroxymethylated DNA molecules inthe cfDNA with the chemoselective group; linking a biotin moiety to thechemoselectively-modified cfDNA via a cycloaddition reaction; enrichingfor biotinylated DNA molecules by binding to a support that binds tobiotin; amplifying the enriched DNA using primers that bind to theadaptors; and sequencing the amplified DNA to produce a plurality ofsequence reads.

A method comprising: (a) obtaining a sample comprising circulatingcell-free DNA, (b) enriching for the hydroxymethylated DNA in thesample, and (c) independently quantifying the amount of nucleic acids inthe enriched hydroxymethylated DNA that map to each of one or moretarget loci is also provided.

Among other things, the sequences obtained from the method can be usedas a diagnostic, theranostic or prognostic for a variety of diseases orconditions, for example.

Also provided are a variety of compositions, including a compositioncomprising circulating cell-free DNA, wherein the hydroxymethylcytosinesresidues in the DNA are modified to contain a capture tag.

These and other features of the present teachings are set forth herein.

BRIEF DESCRIPTION OF THE FIGURES

The skilled artisan will understand that the drawings, described below,are for illustration purposes only. The drawings are not intended tolimit the scope of the present teachings in any way

FIGS. 1A-1C: Sequencing of 5hmC in cfDNA. FIG. 1A: General procedure ofcell-free 5hmC sequencing. cfDNA is ligated with Illumina adapter andlabeled with biotin on 5hmC for pull-down with streptavidin beads. Thefinal library is completed by directly PCR from streptavidin beads. FIG.1B: Percentage of reads mapped to spike-in DNA in the sequencinglibraries. Error bars indicate s.d. FIG. 1C: Metagene profiles of log 2fold change of cell-free 5hmC to input cfDNA ratio in genes rankedaccording to their expression in cell-free RNA-Seq.

FIGS. 2A-2D: Lung cancer leads to progressive loss of 5hmC enrichment incfDNA. FIG. 2A: Genome browser view of the cell-free 5hmC distributionin a 10 mb region in chromosome 6. Showing the overlap tracks ofhealthy, non-metastatic lung cancer, metastatic lung cancer, and inputcfDNA samples in line plot. FIG. 2B: Heatmap of 1,159 metastatic lungcancer differential genes in healthy, lung cancer samples and theunenriched input cfDNA. Hierarchical clustering was performed acrossgenes and samples. FIG. 2C: Boxplot of number of hMRs (normalized to 1million reads) identified in each group. FIG. 2D: Boxplots of CCNY andPDIA6 5hmC FPKM in lung cancer and other cfDNA samples. *P<0.05,**P<0.01, ***P<0.001, ****P<1e-5, Welch t-test.

FIGS. 3A-3E: Cell-free 5hmC for monitoring HCC progression andtreatment. FIG. 3A: tSNE plot of 5hmC FPKM from healthy, HBV and HCCsamples. FIG. 3B: Heatmap of 1,006 HCC differential genes in healthy,HBV and HCC samples. Hierarchical clustering was performed across genesand samples. FIGS. 3C-3D: Boxplots of AHSG (FIG. 3C) and MTBP (FIG. 3D)5hmC FPKM in HBV, HCC (pre-op), HCC post-op, HCC recurrence and othercfDNA samples. *P<0.05, **P<1e-4, ***P<1e-5, Welch t-test. FIG. 3E: tSNEplot of 5hmC FPKM from healthy, HCC pre-op, HCC post-op and HCCrecurrence samples.

FIGS. 4A-4C: Cancer type and stage prediction with cell-free 5hmC. FIG.4A: tSNE plot of 5hmC FPKM in cfDNA from healthy and various cancersamples. FIG. 4B: The actual and predicted classification byleave-one-out cross-validation using Mclust (MC) and Random Forest (RF)algorithm, based on two feature sets (gene body and DhMR). FIG. 4C: TheCohen's kappa coefficient for measuring inter-classifier agreement (GBfor gene body). The error bar indicates the standard error of theCohen's kappa estimate.

FIGS. 5A-5F: Cell-free 5hmC sequencing by modified hMe-Seal. FIG. 5A:hMe-Seal reactions. 5hmC in DNA is labeled with an azide-modifiedglucose by βGT, which is then linked to a biotin group through clickchemistry. FIG. 5B: Enrichment tests of a single pool of ampliconscontaining C, 5mC or 5hmC spiked into cfDNA. Showing gel analysis thatafter hMe-Seal, only 5hmC-containing amplicon can be PCRed from thestreptavidin beads. FIG. 5C: Boxplot of sequencing depth across allcell-free samples. FIG. 5D: Boxplot of unique nonduplicate map rateacross all cell-free samples. FIG. 5E: MA-plot of normalized cell-free5hmC read counts (reads/million) in 10 kb bins genome-wide betweentechnical duplicate. The horizontal blue line M=0 indicates same valuein two sample. A lowess fit (in red) is plotted underlying a possibletrend in the bias related to the mean value. FIG. 5F: Venn diagram ofhMRs overlap between technical replications of cell-free 5hmC sequencingand a pooled sample from both replicates.

FIGS. 6A-6D: Genome-wide distribution of 5hmC in cfDNA. FIG. 6A: Genomebrowser view of the 5hmC distribution in a 10 mb region in chromosome20. Showing the tracks of enriched cfDNA and whole blood gDNA samplesalong with the unenriched input cfDNA. FIG. 6B: Pie chart presentationof the overall genomic distribution of hMRs in cfDNA. FIG. 6C: Therelative enrichment of hMRs across distinct genomic regions in cfDNA andwhole blood gDNA. FIG. 6D: tSNE plot of 5hmC FPKM in cfDNA and wholeblood gDNA from healthy samples.

FIGS. 7A-7E: Differential 5hmC signals between cfDNA and whole bloodgDNA. FIG. 7A: Heatmap of 2,082 differential genes between cfDNA andblood gDNA. Hierarchical clustering was performed across genes andsamples. FIG. 7B: Boxplot of expression level in whole blood for cfDNAand whole blood gDNA 5hmC enriched genes. The p-value is shown on top.FIGS. 7C and 7D: GO analysis of the whole blood-specific (FIG. 7C) andcfDNA-specific (FIG. 7D) 5hmC enriched genes, adjusted p-value cut off0.001. FIG. 7E: Genome browser view of the 5hmC distribution in theFPR1/FPR2 (top) and the GLP1R (bottom) loci. Showing the overlap tracksof cfDNA, whole blood gDNA and input cfDNA in line plot.

FIGS. 8A-8D: Cell-free hydroxymethylome in lung cancer. FIG. 8A: tSNEplot of 5hmC FPKM from healthy, non-metastatic lung cancer andmetastatic lung cancer samples, along with the unenriched input cfDNA.FIG. 8B: Metagene profiles of cell-free 5hmC in healthy and variouscancer groups, along with unenriched input cfDNA. Shaded area indicates.e.m. FIG. 8C: Percentage of reads mapped to spike-in DNA in thesequencing libraries of various groups. Error bars indicate s.d. FIG.8D: Genome browser view of the cell-free 5hmC distribution in theCREM/CCNY (left) and ATP6V1C2/PDIA6 (right) loci in healthy and lungcancer samples. Showing the overlap tracks in line plot.

FIGS. 9A-9E: Cell-free hydroxymethylome in HCC. FIG. 9A: Boxplot ofexpression level in liver tissue for HCC-specific 5hmC enriched anddepleted genes. The p-value is shown on top. FIG. 9B: Genome browserview of the cell-free 5hmC distribution in the AHSG locus in healthy HBVand HCC samples. Showing the overlap tracks in line plot. FIG. 9C:Expression of AHSG in liver and other tissues. FIG. 9D: Genome browserview of the cell-free 5hmC distribution in the MTBP locus in healthy,HBV and HCC samples. Showing the overlap tracks in line plot. FIG. 9E:Changes of HCC score in 4 HCC follow-up cases. Disease status shown onthe bottom. Time duration in month shown on the top. Dotted linesindicate the median values of HCC scores in the HCC, HBV, and healthygroups. Triangles indicate treatment. HCC score is a linear combinationof 1,006 HCC differential genes (FIG. 3B) that best separates HCC fromHBV and healthy samples.

FIGS. 10A-10E: Cell-free hydroxymethylome in pancreatic cancer. FIG.10A: Heatmap of 713 pancreatic cancer differential genes in healthy andpancreatic cancer samples. Hierarchical clustering was performed acrossgenes and samples. FIGS. 10B and 10C, Boxplots of ZFP36L1, DCXR (FIG.10B) and GPR21, SLC19A3 (FIG. 10C) 5hmC FPKM in pancreatic cancer andother cfDNA samples. *P<0.001, **P<1e-5, Welch t-test. FIGS. 10D and10E: Genome browser view of the cell-free 5hmC distribution in theZFP36L1, DCXR (FIG. 10D) and GPR21, SLC19A3 (FIG. 10E) loci in healthyand pancreatic cancer samples. Showing the overlap tracks in line plot.

FIGS. 11A-11D: Cell-free hydroxymethylome in cancer samples. FIG. 11A:tSNE plot of promoters 5hmC FPKM (5 kb upstream of TSS) from healthy andvarious cancer samples. FIG. 11B: tSNE plot of 5hmC FPKM from healthyand various cancer cfDNA samples along with the whole blood gDNAsamples. FIG. 11C: Age distribution of healthy individual and variouscancer patients. FIG. 11D: tSNE plot of 5hmC FPKM in cfDNA from healthyand various cancer samples (FIG. 4A) colored by batches numberedaccording to the process time.

FIG. 12A-12G: Cancer type and stage prediction with cell-free 5hmC.FIGS. 12A and 12B: Bayesian Information Criterion (BIC) plot by Mclusttrained with 90 gene body feature set (FIG. 12A) and 17 DhMRs featureset (FIG. 12B), indicating high BIC value for separating five groupswhen using EEI model for Mclust. FIG. 12C, 4-Dimensional Mclust-baseddimensionality reduction plot using DhMRs features. The lower half showsthe scatter plot and the upper half shows the density plot. FIGS. 12Dand 12E: Variable importance (mean decrease Gini) for the top 15 genebodies (FIG. 12D) and DhMRs (FIG. 12E), in the random forest trainingmodel. FIGS. 12F and 12G show the variable importance for gene bodiesand DhMRS, obtained using a different method.

FIG. 13: Examples of DhMRs in the random forest model. Genome browserview of the cell-free 5hmC distribution in four DhMRs with high variableimportance in the random forest model in various groups. Showing theoverlap tracks in line plot. Shaded area indicates the DhMR.

DEFINITIONS

Unless defined otherwise herein, all technical and scientific terms usedherein have the same meaning as commonly understood by one of ordinaryskill in the art to which this invention belongs. Although any methodsand materials similar or equivalent to those described herein can beused in the practice or testing of the present invention, the preferredmethods and materials are described.

All patents and publications, including all sequences disclosed withinsuch patents and publications, referred to herein are expresslyincorporated by reference.

Numeric ranges are inclusive of the numbers defining the range. Unlessotherwise indicated, nucleic acids are written left to right in 5′ to 3′orientation; amino acid sequences are written left to right in amino tocarboxy orientation, respectively.

The headings provided herein are not limitations of the various aspectsor embodiments of the invention. Accordingly, the terms definedimmediately below are more fully defined by reference to thespecification as a whole.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Singleton, et al., DICTIONARYOF MICROBIOLOGY AND MOLECULAR BIOLOGY, 2D ED., John Wiley and Sons, NewYork (1994), and Hale & Markham, THE HARPER COLLINS DICTIONARY OFBIOLOGY, Harper Perennial, N.Y. (1991) provide one of skill with thegeneral meaning of many of the terms used herein. Still, certain termsare defined below for the sake of clarity and ease of reference.

The term “sample” as used herein relates to a material or mixture ofmaterials, typically, although not necessarily, in liquid form,containing one or more analytes of interest.

The term “nucleic acid sample,” as used herein denotes a samplecontaining nucleic acids. Nucleic acid samples used herein may becomplex in that they contain multiple different molecules that containsequences. Genomic DNA from a mammal (e.g., mouse or human) are types ofcomplex samples. Complex samples may have more then 10⁴, 10⁵, 10⁶ or 10⁷different nucleic acid molecules. A DNA target may originate from anysource such as genomic DNA, or an artificial DNA construct. Any samplecontaining nucleic acid, e.g., genomic DNA made from tissue culturecells or a sample of tissue, may be employed herein. A nucleic acidsample can be made from any suitable source, including a sample oftooth, bone, hair or bone, etc.

The term “nucleotide” is intended to include those moieties that containnot only the known purine and pyrimidine bases, but also otherheterocyclic bases that have been modified. Such modifications includemethylated purines or pyrimidines, acylated purines or pyrimidines,alkylated riboses or other heterocycles. In addition, the term“nucleotide” includes those moieties that contain hapten or fluorescentlabels and may contain not only conventional ribose and deoxyribosesugars, but other sugars as well. Modified nucleosides or nucleotidesalso include modifications on the sugar moiety, e.g., wherein one ormore of the hydroxyl groups are replaced with halogen atoms or aliphaticgroups, or are functionalized as ethers, amines, or the like.

The term “nucleic acid” and “polynucleotide” are used interchangeablyherein to describe a polymer of any length, e.g., greater than about 2bases, greater than about 10 bases, greater than about 100 bases,greater than about 500 bases, greater than 1000 bases, up to about10,000 or more bases composed of nucleotides, e.g., deoxyribonucleotidesor ribonucleotides, and may be produced enzymatically or synthetically(e.g., PNA as described in U.S. Pat. No. 5,948,902 and the referencescited therein) which can hybridize with naturally occurring nucleicacids in a sequence specific manner analogous to that of two naturallyoccurring nucleic acids, e.g., can participate in Watson-Crick basepairing interactions. Naturally-occurring nucleotides include guanine,cytosine, adenine and thymine (G, C, A and T, respectively). DNA and RNAhave a deoxyribose and ribose sugar backbone, respectively, whereasPNA's backbone is composed of repeating N-(2-aminoethyl)-glycine unitslinked by peptide bonds. In PNA various purine and pyrimidine bases arelinked to the backbone by methylene carbonyl bonds. A locked nucleicacid (LNA), often referred to as inaccessible RNA, is a modified RNAnucleotide. The ribose moiety of an LNA nucleotide is modified with anextra bridge connecting the 2′ oxygen and 4′ carbon. The bridge “locks”the ribose in the 3′-endo (North) conformation, which is often found inthe A-form duplexes. LNA nucleotides can be mixed with DNA or RNAresidues in the oligonucleotide whenever desired. The term “unstructurednucleic acid,” or “UNA,” is a nucleic acid containing non-naturalnucleotides that bind to each other with reduced stability. For example,an unstructured nucleic acid may contain a G′ residue and a C′ residue,where these residues correspond to non-naturally occurring forms, i.e.,analogs, of G and C that base pair with each other with reducedstability, but retain an ability to base pair with naturally occurring Cand G residues, respectively. Unstructured nucleic acid is described inUS20050233340, which is incorporated by reference herein for disclosureof UNA. Also included in this definition are ZNAs, i.e., zip nucleicacids.

The term “oligonucleotide” as used herein denotes a single-strandedmultimer of nucleotide of from about 2 to 200 nucleotides, up to 500nucleotides in length. Oligonucleotides may be synthetic or may be madeenzymatically, and, in some embodiments, are 30 to 150 nucleotides inlength. Oligonucleotides may contain ribonucleotide monomers (i.e., maybe oligoribonucleotides) and/or deoxyribonucleotide monomers. Anoligonucleotide may be 10 to 20, 21 to 30, 31 to 40, 41 to 50, 51 to 60,61 to 70, 71 to 80, 80 to 100, 100 to 150 or 150 to 200 nucleotides inlength, for example.

The term “hybridization” refers to the process by which a strand ofnucleic acid joins with a complementary strand through base pairing asknown in the art. A nucleic acid is considered to be “selectivelyhybridizable” to a reference nucleic acid sequence if the two sequencesspecifically hybridize to one another under moderate to high stringencyhybridization and wash conditions. Moderate and high stringencyhybridization conditions are known (see, e.g., Ausubel, et al., ShortProtocols in Molecular Biology, 3rd ed., Wiley & Sons 1995 and Sambrooket al., Molecular Cloning: A Laboratory Manual, Third Edition, 2001 ColdSpring Harbor, N.Y.). One example of high stringency conditions includeshybridization at about 42° C. in 50% formamide, 5×SSC, 5×Denhardt'ssolution, 0.5% SDS and 100 μg/ml denatured carrier DNA followed bywashing two times in 2×SSC and 0.5% SDS at room temperature and twoadditional times in 0.1×SSC and 0.5% SDS at 42° C.

“Primer” means an oligonucleotide, either natural or synthetic, that iscapable, upon forming a duplex with a polynucleotide template, of actingas a point of initiation of nucleic acid synthesis and being extendedfrom its 3′ end along the template so that an extended duplex is formed.The sequence of nucleotides added during the extension process isdetermined by the sequence of the template polynucleotide. Usuallyprimers are extended by a DNA polymerase. Primers are generally of alength compatible with their use in synthesis of primer extensionproducts, and are usually in the range of between 8 to 100 nucleotidesin length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21to 50, 22 to 45, 25 to 40, and so on. Typical primers can be in therange of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30,21-25 and so on, and any length between the stated ranges. In someembodiments, the primers are usually not more than about 10, 12, 15, 20,21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or70 nucleotides in length.

The term “duplex,” or “duplexed,” as used herein, describes twocomplementary polynucleotides that are base-paired, i.e., hybridizedtogether.

The terms “determining,” “measuring,” “evaluating,” “assessing,”“assaying,” and “analyzing” are used interchangeably herein to refer toany form of measurement, and include determining if an element ispresent or not. These terms include both quantitative and/or qualitativedeterminations. Assessing may be relative or absolute. “Assessing thepresence of” includes determining the amount of something present, aswell as determining whether it is present or absent.

The term “using” has its conventional meaning, and, as such, meansemploying, e.g., putting into service, a method or composition to attainan end. For example, if a program is used to create a file, a program isexecuted to make a file, the file usually being the output of theprogram. In another example, if a computer file is used, it is usuallyaccessed, read, and the information stored in the file employed toattain an end. Similarly if a unique identifier, e.g., a barcode isused, the unique identifier is usually read to identify, for example, anobject or file associated with the unique identifier.

The term “ligating,” as used herein, refers to the enzymaticallycatalyzed joining of the terminal nucleotide at the 5′ end of a firstDNA molecule to the terminal nucleotide at the 3′ end of a second DNAmolecule.

A “plurality” contains at least 2 members. In certain cases, a pluralitymay have at least 10, at least 100, at least 100, at least 10,000, atleast 100,000, at least 10⁶, at least 10⁷, at least 10⁸ or at least 10⁹or more members.

If two nucleic acids are “complementary,” each base of one of thenucleic acids base pairs with corresponding nucleotides in the othernucleic acid. Two nucleic acids do not need to be perfectlycomplementary in order to hybridize to one another.

The term “separating,” as used herein, refers to physical separation oftwo elements (e.g., by size or affinity, etc.) as well as degradation ofone element, leaving the other intact.

The term “sequencing,” as used herein, refers to a method by which theidentity of at least 10 consecutive nucleotides (e.g., the identity ofat least 20, at least 50, at least 100 or at least 200 or moreconsecutive nucleotides) of a polynucleotide is obtained.

The terms “next-generation sequencing” or “high-throughput sequencing”,as used herein, refer to the so-called parallelizedsequencing-by-synthesis or sequencing-by-ligation platforms currentlyemployed by Illumina, Life Technologies, and Roche, etc. Next-generationsequencing methods may also include nanopore sequencing methods such asthat commercialized by Oxford Nanopore Technologies,electronic-detection based methods such as Ion Torrent technologycommercialized by Life Technologies, or single-moleculefluorescence-based methods such as that commercialized by PacificBiosciences.

The term “next-generation sequencing” refers to the so-calledparallelized sequencing-by-synthesis or sequencing-by-ligation platformscurrently employed by Illumina, Life Technologies, and Roche, etc.Next-generation sequencing methods may also include nanopore sequencingmethods or electronic-detection based methods such as Ion Torrenttechnology commercialized by Life Technologies.

The term “adaptor” refers to a nucleic acid that is ligatable to bothstrands of a double-stranded DNA molecule. In one embodiment, an adaptormay be a hairpin adaptor (i.e., one molecule that base pairs with itselfto form a structure that has a double-stranded stem and a loop, wherethe 3′ and 5′ ends of the molecule ligate to the 5′ and 3′ ends of thedouble-stranded DNA molecule, respectively). In another embodiment, anadaptor may be a Y-adaptor. In another embodiment, an adaptor may itselfbe composed of two distinct oligonucleotide molecules that are basepaired with one another. As would be apparent, a ligatable end of anadaptor may be designed to be compatible with overhangs made by cleavageby a restriction enzyme, or it may have blunt ends or a 5′ T overhang.The term “adaptor” refers to double-stranded as well as single-strandedmolecules. An adaptor can be DNA or RNA, or a mixture of the two. Anadaptor containing RNA may be cleavable by RNase treatment or byalkaline hydrolysis. An adaptor may be 15 to 100 bases, e.g., 50 to 70bases, although adaptors outside of this range are envisioned.

The term “adaptor-ligated,” as used herein, refers to a nucleic acidthat has been ligated to an adaptor. The adaptor can be ligated to a 5′end and/or a 3′ end of a nucleic acid molecule.

The term “asymmetric adaptor”, as used herein, refers to an adaptorthat, when ligated to both ends of a double stranded nucleic acidfragment, will lead to a top strand that contains a 5′ tag sequence thatis not the same as or complementary to the tag sequence at the 3′ end.Exemplary asymmetric adapters are described in: U.S. Pat. Nos. 5,712,126and 6,372,434 and WO/2009/032167; all of which are incorporated byreference herein in their entirety. An asymmetrically tagged fragmentcan be amplified by two primers: one that hybridizes to a first tagsequence added to the 3′ end of a strand, and another that hybridizes tothe complement of a second tag sequence added to the 5′ end of a strand.Y-adaptors and hairpin adaptors (which can be cleaved, after ligation,to produce a “Y-adaptor”) are examples of asymmetric adaptors.

The term “Y-adaptor” refers to an adaptor that contains: adouble-stranded region and a single-stranded region in which theopposing sequences are not complementary. The end of the double-strandedregion can be joined to target molecules such as double-strandedfragments of genomic DNA, e.g., by ligation or a transposase-catalyzedreaction. Each strand of an adaptor-tagged double-stranded DNA that hasbeen ligated to a Y-adaptor is asymmetrically tagged in that it has thesequence of one strand of the Y-adaptor at one end and the other strandof the Y-adaptor at the other end. Amplification of nucleic acidmolecules that have been joined to Y-adaptors at both ends results in anasymmetrically tagged nucleic acid, i.e., a nucleic acid that has a 5′end containing one tag sequence and a 3′ end that has another tagsequence.

The term “hairpin adaptor” refers to an adaptor that is in the form of ahairpin. In one embodiment, after ligation the hairpin loop can becleaved to produce strands that have non-complementary tags on the ends.In some cases, the loop of a hairpin adaptor may contain a uracilresidue, and the loop can be cleaved using uracil DNA glycosylase andendonuclease VIII, although other methods are known.

The term “adaptor-ligated sample”, as used herein, refers to a samplethat has been ligated to an adaptor. As would be understood given thedefinitions above, a sample that has been ligated to an asymmetricadaptor contains strands that have non-complementary sequences at the 5′and 3′ ends.

An “oligonucleotide binding site” refers to a site to which anoligonucleotide hybridizes in a target polynucleotide. If anoligonucleotide “provides” a binding site for a primer, then the primermay hybridize to that oligonucleotide or its complement.

The term “strand” as used herein refers to a nucleic acid made up ofnucleotides covalently linked together by covalent bonds, e.g.,phosphodiester bonds. In a cell, DNA usually exists in a double-strandedform, and as such, has two complementary strands of nucleic acidreferred to herein as the “top” and “bottom” strands. In certain cases,complementary strands of a chromosomal region may be referred to as“plus” and “minus” strands, the “first” and “second” strands, the“coding” and “noncoding” strands, the “Watson” and “Crick” strands orthe “sense” and “antisense” strands. The assignment of a strand as beinga top or bottom strand is arbitrary and does not imply any particularorientation, function or structure. The nucleotide sequences of thefirst strand of several exemplary mammalian chromosomal regions (e.g.,BACs, assemblies, chromosomes, etc.) is known, and may be found inNCBI's Genbank database, for example.

The term “tagging” as used herein, refers to the appending of a sequencetag (that contains an identifier sequence) onto a nucleic acid molecule.A sequence tag may be added to the 5′ end, the 3′ end, or both ends ofnucleic acid molecule. A sequence tag can be added to a fragment byligating an adaptor to the fragment by, e.g., T4 DNA ligase or anotherligase.

The term “molecular barcode” encompasses both sample identifiersequences and molecule identifier sequences, as described below. In someembodiments, a molecular barcode may have a length in range of from 1 to36 nucleotides, e.g., from 6 to 30 nucleotides, or 8 to 20 nucleotides.In certain cases, the molecular identifier sequence may beerror-correcting, meaning that even if there is an error (e.g., if thesequence of the molecular barcode is mis-synthesized, mis-read or isdistorted by virtue of the various processing steps leading up to thedetermination of the molecular barcode sequence) then the code can stillbe interpreted correctly. Descriptions of exemplary error correctingsequences can be found throughout the literature (e.g., US20100323348and US20090105959, which are both incorporated herein by reference). Insome embodiments, an identifier sequence may be of relatively lowcomplexity (e.g., may be composed of a mixture of 4 to 1024 differentsequences), although higher complexity identifier sequences can be usedin some cases.

The term “sample identifier sequence” and “sample index” is a sequenceof nucleotides that is appended to a target polynucleotide, where thesequence identifies the source of the target polynucleotide (i.e., thesample from which sample the target polynucleotide is derived). In use,each sample is tagged with a different sample identifier sequence (e.g.,one sequence is appended to each sample, where the different samples areappended to different sequences), and the tagged samples are pooled.After the pooled sample is sequenced, the sample identifier sequence canbe used to identify the source of the sequences. A sample identifiersequence may be added to the 5′ end of a polynucleotide or the 3′ end ofa polynucleotide. In certain cases some of the sample identifiersequence may be at the 5′ end of a polynucleotide and the remainder ofthe sample identifier sequence may be at the 3′ end of thepolynucleotide. When elements of the sample identifier has sequence ateach end, together, the 3′ and 5′ sample identifier sequences identifythe sample. In many examples, the sample identifier sequence is only asubset of the bases which are appended to a target oligonucleotide.

The term “molecule identifier sequence” is a sequence of nucleotidesthat can be appended to the nucleic acid fragments of a sample such thatthe appended sequence of nucleotides, alone or in combination with otherfeatures of the fragments, e.g., their fragmentation breakpoints, can beused to distinguish between the different fragment molecules in thesample or a portion thereof. The complexity of a population of moleculeidentifier sequences used in any one implementation may vary dependingon a variety of parameters, e.g., the number of fragments in a sampleand/or the amount of the sample that is used in a subsequent step. Forexample, in certain cases, the molecule identifier sequence may be oflow complexity (e.g., may be composed of a mixture of 8 to 1024sequences). In other cases, the molecule identifier sequence may be ofhigh complexity (e.g., may be composed of 1025 to 1M or more sequences).In certain embodiments, a population of molecule identifier sequencesmay comprise a degenerate base region (DBR) comprising one or more(e.g., at least 2, at least 3, at least 4, at least 5, or 5 to 30 ormore) nucleotides selected from R, Y, S, W, K, M, B, D, H, V, N (asdefined by the IUPAC code), or a variant thereof. As described in U.S.Pat. No. 8,741,606, a molecule identifier sequence may be made up ofsequences that are non-adjacent. In some embodiments, a population ofmolecule identifier sequences may by made by mixing oligonucleotides ofa defined sequence together. In these embodiments, the moleculeidentifier sequence in each of the oligonucleotides may be errorcorrecting. In the methods described herein, the molecule identifiersequence may be used to distinguish between the different fragments in aportion of an initial sample, where the portion has been removed fromthe initial sample. The molecule identifier sequences may be used inconjunction with other features of the fragments (e.g., the endsequences of the fragments, which define the breakpoints) to distinguishbetween the fragments.

As used herein, the term “correspond to”, with reference to a sequenceread that corresponds to a particular (e.g., the top or bottom) strandof a fragment, refers to a sequence read derived from that strand or anamplification product thereof.

The term “covalently linking” refers to the production of a covalentlinkage between two separate molecules.

As used herein, the term “circulating cell-free DNA” refers to DNA thatis circulating in the peripheral blood of a patient. The DNA moleculesin cell-free DNA may have a median size that is below 1 kb (e.g., in therange of 50 bp to 500 bp, 80 bp to 400 bp, or 100-1,000 bp), althoughfragments having a median size outside of this range may be present.Cell-free DNA may contain circulating tumor DNA (ctDNA), i.e., tumor DNAcirculating freely in the blood of a cancer patient or circulating fetalDNA (if the subject is a pregnant female). cfDNA can be highlyfragmented and in some cases can have a mean fragment size about 165-250bp (Newman et al Nat Med. 2014 20: 548-54). cfDNA can be obtained bycentrifuging whole blood to remove all cells, and then isolating the DNAfrom the remaining plasma or serum. Such methods are well known (see,e.g., Lo et al, Am J Hum Genet 1998; 62:768-75). Circulating cell-freeDNA is double-stranded, but can be made single stranded by denaturation.

As used herein, the term “adding adaptor sequences” refers to the act ofadding an adaptor sequence to the end of fragments in a sample. This maybe done by filling in the ends of the fragments using a polymerase,adding an A tail, and then ligating an adaptor comprising a T overhangonto the A-tailed fragments.

As used herein, the term “UDP glucose modified with a chemoselectivegroup” refers to a UDP glucose that has been functionalized,particularly at the 6-hydroxyl position, to include a group that iscapable of participating in a 1,3 cycloaddition (or “click”) reaction.Such groups include azido and alkynyl (e.g., cyclooctyne) groups,although others are known (Kolb et al., 2001; Speers and Cravatt, 2004;Sletten and Bertozzi, 2009). UDP-6-N₃-Glu is an example of a UDP glucosemodified with a chemoselective group, although others are known.

As used herein, the term “biotin moiety” refers to an affinity tag thatincludes biotin or a biotin analogue such as desthiobiotin, oxybiotin,2-iminobiotin, diaminobiotin, biotin sulfoxide, biocytin, etc. Biotinmoieties bind to streptavidin with an affinity of at least 10⁻⁸ M.

As used herein, the terms “cycloaddition reaction” and “click reaction”are described interchangeably to refer to a 1,3-cycloaddition between anazide and alkyne to form a five membered heterocycle. In someembodiments, the alkyne may be strained (e.g., in a ring such ascyclooctyne) and the cycloaddition reaction may done in copper freeconditions. Dibenzocyclooctyne (DBCO) and difluorooctyne (DIFO) areexamples of alkynes that can participate in a copper-free cycloadditionreaction, although other groups are known. See, e.g., Kolb et al (DrugDiscov Today 2003 8: 1128-113), Baskin et al (Proc. Natl. Acad. Sci.2007 104: 16793-16797) and Sletten et al (Accounts of Chemical Research2011 44: 666-676) for a review of this chemistry.

As used herein, the term “support that binds to biotin” refers to asupport (e.g., beads, which may be magnetic) that is linked tostreptavidin or avidin, or a functional equivalent thereof.

The term “amplifying” as used herein refers to generating one or morecopies of a target nucleic acid, using the target nucleic acid as atemplate.

The term “copies of fragments” refers to the product of amplification,where a copy of a fragment can be a reverse complement of a strand of afragment, or have the same sequence as a strand of a fragment.

The terms “enrich” and “enrichment” refers to a partial purification ofanalytes that have a certain feature (e.g., nucleic acids that containhydroxymethylcytosine) from analytes that do not have the feature (e.g.,nucleic acids that contain hydroxymethylcytosine). Enrichment typicallyincreases the concentration of the analytes that have the feature (e.g.,nucleic acids that contain hydroxymethylcytosine) by at least 2-fold, atleast 5-fold or at least 10-fold relative to the analytes that do nothave the feature. After enrichment, at least 10%, at least 20%, at least50%, at least 80% or at least 90% of the analytes in a sample may havethe feature used for enrichment. For example, at least 10%, at least20%, at least 50%, at least 80% or at least 90% of the nucleic acidmolecules in an enriched composition may contain a strand having one ormore hydroxymethylcytosines that have been modified to contain a capturetag.

Other definitions of terms may appear throughout the specification.

DESCRIPTION OF EXEMPLARY EMBODIMENTS

Provided herein is a method of sequencing hydroxymethyated cell-freeDNA. In some embodiments, the method comprises adding an affinity tag toonly hydroxymethyated DNA molecules in a sample of cfDNA, enriching forthe DNA molecules that are tagged with the affinity tag; and sequencingthe enriched DNA molecules.

FIG. 1A shows one implementation of the method. In certain embodimentsand with reference to FIG. 1A, the method may comprise: (a) addingadaptor sequences onto the ends of cell-free (cfDNA), (b) incubating theadaptor-ligated cfDNA with a DNA β-glucosyltransferase and UDP glucosemodified with a chemoselective group, thereby covalently labeling thehyroxymethylated DNA molecules in the cfDNA with the chemoselectivegroup; (c) linking a biotin moiety to the chemoselectively-modifiedcfDNA via a cycloaddition reaction; (d) enriching for the biotinylatedDNA molecules by binding the product of the biotin labeling step (stepc) to a support that binds to biotin; (e) amplifying the enriched DNAusing primers that bind to the adaptors; and (f) sequencing theamplified DNA to produce a plurality of sequence reads.

As shown in FIG. 1A, in some embodiments, the method does not comprisereleasing the biotinylated DNA molecules from the support prior toamplification (i.e., after step (d), prior to step (e)) and, as such, insome embodiments the amplifying step (d) may comprise amplifying theenriched DNA while it is bound to the support of (c). This may beimplemented by: i. washing the support of (d) after the biotinylated DNAmolecules have bound to the support; and then ii. setting up anamplification reaction containing the support, without releasing thebiotinylated DNA molecules from the support.

Also as shown in FIG. 1A, step (a) may be implemented by ligating theDNA is to a universal adaptor, i.e., an adaptor that ligates to bothends of the fragments of cfDNA. In certain cases, the universal adaptormay be done by ligating a Y adaptor (or hairpin adaptor) onto the endsof the cfDNA, thereby producing a double stranded DNA molecule that hasa top strand that contains a 5′ tag sequence that is not the same as orcomplementary to the tag sequence added the 3′ end of the strand. Asshould be apparent, the DNA fragments used in the initial step of themethod should be non-amplified DNA that has not been denaturedbeforehand. As shown in FIG. 1A, this step may require polishing (i.e.,blunting) the ends of the cfDNA with a polymerase, A-tailing thefragments using, e.g., Taq polymerase, and ligating a T-tailed Y adaptorto the A-tailed fragments. This initial ligation step may be done on alimiting amount of cfDNA. For example, cfDNA to which the adaptors areligated may contain less than 200 ng of DNA, e.g., 10 pg to 200 ng, 100pg to 200 ng, 1 ng to 200 ng or 5 ng to 50 ng, or less than 10,000(e.g., less than 5,000, less than 1,000, less than 500, less than 100 orless than 10) haploid genome equivalents, depending on the genome. Insome embodiments, the method is done using less than 50 ng of cfDNA(which roughly corresponds to approximately 5 mls of plasma) or lessthan 10 ng of cfDNA, which roughly corresponds to approximately 1 mls ofplasma. For example, Newman et al (Nat Med. 2014 20: 548-54) madelibraries from 7-32 ng cfDNA isolated from 1-5 mL plasma. This isequivalent to 2,121-9,697 haploid genomes (assuming 3.3 pg per haploidgenome). The adaptor ligated onto the cfDNA may contain a molecularbarcode to facilitate multiplexing and quantitative analysis of thesequenced molecules. Specifically, the adaptor may be “indexed” in thatit contains a molecular barcode that identifies the sample to which itwas ligated (which allows samples to be pooled before sequencing).Alternatively or in addition, the adaptor may contain a random barcodeor the like. Such an adaptor can be ligated to the fragments andsubstantially every fragment corresponding to a particular region aretagged with a different sequence. This allows for identification of PCRduplicates and allows molecules to be counted.

In the next step of this implementation of the method, thehydroxymethylated DNA molecules in the cfDNA are labeled with a with thechemoselective group, i.e., a group that can participate in a clickreaction. This step may be done by incubating the adaptor-ligated cfDNAwith DNA β-glucosyltransferase (e.g., T4 DNA β-glucosyltransferase(which is commercially available from a number of vendors), althoughother DNA β-glucosyltransferases exist) and, e.g., UDP-6-N₃-Glu (i.e.,UDP glucose containing an azide). This step may be done using a protocoladapted from US20110301045 or Song et al, (Nat. Biotechnol. 2011 29:68-72), for example.

The next step of this implementation of the method involves adding abiotin moiety to the chemoselectively modified DNA via a cycloaddition(click) reaction. This step may be done by directly adding abiotinylated reactant, e.g., a dibenzocyclooctyne-modified biotin to theglucosyltransferase reaction after that reaction has been completed,i.e., after an appropriate amount of time (e.g., after 30 minutes ormore). In some embodiments, the biotinylated reactant may be of thegeneral formula B-L-X, where B is a biotin moiety, L is a linker and Xis a group that reacts with the chemoselective group added to the cfDNAvia a cycloaddition reaction. In certain cases, the linker may make thecompound more soluble in an aqueous environment and, as such, maycontain a polyethyleneglycol (PEG) linker or an equivalent thereof. Insome embodiments, the added compound may bedibenzocyclooctyne-PEG_(n)-biotin, where N is 2-10, e.g., 4.Dibenzocyclooctyne-PEG4-biotin is relatively hydrophilic and is solublein aqueous buffer up to a concentration of 0.35 mM. The compound addedin this step does not need to contain a cleavable linkage, e.g., doesnot contain a disulfide linkage or the like. In this step, thecycloaddition reaction may be between an azido group added to thehydroxymethylated cfDNA and an alkynyl group (e.g., dibenzocyclooctynegroup) that is linked to the biotin moiety. Again, this step may be doneusing a protocol adapted from US20110301045 or Song et al), Nat.Biotechnol. 2011 29: 68-72), for example.

The enrichment step of the method may be done using magneticstreptavidin beads, although other supports could be used. As notedabove, the enriched cfDNA molecules (which correspond to thehydroxymethylated cfDNA molecules) are amplified by PCR and thensequenced.

In these embodiments, the enriched DNA sample may be amplified using oneor more primers that hybridize to the added adaptors (or theircomplements). In embodiments in which Y-adaptors are added, theadaptor-ligated nucleic acids may be amplified by PCR using two primers:a first primer that hybridizes to the single-stranded region of the topstrand of the adaptor, and a second primer that hybridizes to thecomplement of the single-stranded region of the bottom strand of the Yadaptor (or hairpin adaptor, after cleavage of the loop). For example,in some embodiments the Y adaptor used may have P5 and P7 arms (whichsequences are compatible with Illumina's sequencing platform) and theamplification products will have the P5 sequence at one and the P7sequence at the other. These amplification products can be hybridized toan Illumina sequencing substrate and sequenced. In another embodiment,the pair of primers used for amplification may have 3′ ends thathybridize to the Y adaptor and 5′ tails that either have the P5 sequenceor the P7 sequence. In these embodiment, the amplification products willalso have the P5 sequence at one and the P7 sequence at the other. Theseamplification products can be hybridized to an Illumina sequencingsubstrate and sequenced. This amplification step may be done by limitedcycle PCR (e.g., 5-20 cycles).

The sequencing step may be done using any convenient next generationsequencing method and may result in at least 10,000, at least 50,000, atleast 100,000, at least 500,000, at least 1M at least 10M at least 100Mor at least 1B sequence reads. In some cases, the reads are paired-endreads. As would be apparent, the primers used for amplification may becompatible with use in any next generation sequencing platform in whichprimer extension is used, e.g., Illumina's reversible terminator method,Roche's pyrosequencing method (454), Life Technologies' sequencing byligation (the SOLiD platform), Life Technologies' Ion Torrent platformor Pacific Biosciences' fluorescent base-cleavage method. Examples ofsuch methods are described in the following references: Margulies et al(Nature 2005 437: 376-80); Ronaghi et al (Analytical Biochemistry 1996242: 84-9); Shendure (Science 2005 309: 1728); Imelfort et al (BriefBioinform. 2009 10:609-18); Fox et al (Methods Mol Biol. 2009;553:79-108); Appleby et al (Methods Mol Biol. 2009; 513:19-39) English(PLoS One. 2012 7: e47768) and Morozova (Genomics. 2008 92:255-64),which are incorporated by reference for the general descriptions of themethods and the particular steps of the methods, including all startingproducts, reagents, and final products for each of the steps.

In certain embodiments, the sample sequenced may comprise a pool of DNAmolecules from a plurality of samples, wherein the nucleic acids in thesample have a molecular barcode to indicate their source. In someembodiments the nucleic acids being analyzed may be derived from asingle source (e.g., a single organism, virus, tissue, cell, subject,etc.), whereas in other embodiments, the nucleic acid sample may be apool of nucleic acids extracted from a plurality of sources (e.g., apool of nucleic acids from a plurality of organisms, tissues, cells,subjects, etc.), where by “plurality” is meant two or more. As such, incertain embodiments, a nucleic acid sample can contain nucleic acidsfrom 2 or more sources, 3 or more sources, 5 or more sources, 10 or moresources, 50 or more sources, 100 or more sources, 500 or more sources,1000 or more sources, 5000 or more sources, up to and including about10,000 or more sources. Molecular barcodes may allow the sequences fromdifferent sources to be distinguished after they are analyzed.

The sequence reads may be analyzed by a computer and, as such,instructions for performing the steps set forth below may be set forthas programming that may be recorded in a suitable physical computerreadable storage medium.

In some embodiments, the sequence reads may be analyzed to provide aquantitative determination of which sequences are hydroxymethylated inthe cfDNA. This may be done by, e.g., counting sequence reads or,alternatively, counting the number of original starting molecules, priorto amplification, based on their fragmentation breakpoint and/or whetherthey contain the same indexer sequence. The use of molecular barcodes inconjunction with other features of the fragments (e.g., the endsequences of the fragments, which define the breakpoints) to distinguishbetween the fragments is known. Molecular barcodes and exemplary methodsfor counting individual molecules are described in Casbon (Nucl. AcidsRes. 2011, 22 e81) and Fu et al (Proc Natl Acad Sci USA. 2011 108:9026-31), among others. Molecular barcodes are described in US2015/0044687, US 2015/0024950, US 2014/0227705, U.S. Pat. Nos. 8,835,358and 7,537,897, as well as a variety of other publications.

In certain embodiments, two different cfDNA samples may be comparedusing the above methods. The different samples may be composed of an“experimental” sample, i.e., a cfDNA sample of interest, and a “control”cfDNA sample to which the experimental cfDNA sample may be compared. Inmany embodiments, the different samples are obtained from subjects, onesubject being a subject of interest, e.g., patient with a disease, andthe other a control subject, a patient does not have the disease.Exemplary sample pairs include, for example, cfDNA from a subject havinga disease such as colon, breast, prostate, lung, skin cancer, orinfected with a pathogen etc.) and cfDNA from normal subjects that donot have the disease, and cfDNA from two different time points from thesame subject, e.g., before and after administration of a therapy, etc.

Also provided is a method for identifying a hydroxymethylation patternthat correlates with phenotype, e.g., a disease, condition or clinicaloutcome, etc. In some embodiments, this method may comprise (a)performing the above-described method on a plurality of cfDNA samples,wherein the cfDNA samples are isolated from patients having a knownphenotype, e.g., disease, condition or clinical outcome, therebydetermining which sequences are hydroxymethylated in cfDNA from each ofthe patients; and (b) identifying a hydryoxymethylation signature thatis correlated with the phenotype.

In some embodiments, the hydryoxymethylation signature may be diagnostic(e.g., may provide a diagnosis of a disease or condition or the type orstage of a disease or condition, etc.), prognostic (e.g., indicating aclinical outcome, e.g., survival or death within a time frame) ortheranostic (e.g., indicating which treatment would be the mosteffective).

Also provided is a method for analyzing a patient sample. In thisembodiment, the method may comprise: (a) identifying, using theabove-described method, sequences that are hydroxymethylated in thecfDNA of a patient; (b) comparing the identified sequences to a set ofsignature sequences that are correlated with a phenotype, e.g., adisease, condition, or clinical outcome etc.; and (c) providing a reportindication a correlation with phenotype. This embodiment may furthercomprise making a diagnosis, prognosis or theranosis based on theresults of the comparison.

In some embodiments, the method may involve creating a report asdescribed above (an electronic form of which may have been forwardedfrom a remote location) and forwarding the report to a doctor or othermedical professional to determine whether a patient has a phenotype(e.g., cancer, etc) or to identify a suitable therapy for the patient.The report may be used as a diagnostic to determine whether the subjecthas a disease or condition, e.g., a cancer. In certain embodiments, themethod may be used to determine the stage or type cancer, to identifymetastasized cells, or to monitor a patient's response to a treatment,for example.

In any embodiment, report can be forwarded to a “remote location”, where“remote location,” means a location other than the location at which theimage is examined. For example, a remote location could be anotherlocation (e.g., office, lab, etc.) in the same city, another location ina different city, another location in a different state, anotherlocation in a different country, etc. As such, when one item isindicated as being “remote” from another, what is meant is that the twoitems can be in the same room but separated, or at least in differentrooms or different buildings, and can be at least one mile, ten miles,or at least one hundred miles apart. “Communicating” informationreferences transmitting the data representing that information aselectrical signals over a suitable communication channel (e.g., aprivate or public network). “Forwarding” an item refers to any means ofgetting that item from one location to the next, whether by physicallytransporting that item or otherwise (where that is possible) andincludes, at least in the case of data, physically transporting a mediumcarrying the data or communicating the data. Examples of communicatingmedia include radio or infra-red transmission channels as well as anetwork connection to another computer or networked device, and theinternet or including email transmissions and information recorded onwebsites and the like. In certain embodiments, the report may beanalyzed by an MD or other qualified medical professional, and a reportbased on the results of the analysis of the image may be forwarded tothe patient from which the sample was obtained.

Also provided is a method for analyzing a sample comprising (a)determining, using the method described above, which sequences arehydroxymethylated in a first sample of cfDNA and which sequences arehydroxymethylated in the second sample of cfDNA, wherein the first andsecond samples of cfDNA are obtained from the same patient at twodifferent time points; and (b) comparing the hydroxymethylation patternfor the first sample to the hydroxymethyation pattern for the secondsample to determine if there has been a change in hydroxymethylationover time. This method may be quantitative and, in some embodiments, thecomparing step (b) may comprise comparing the level ofhydroxymethylation of one or more selected sequences. The comparisonstep of this method may map of the changes in hydroxymethylation in thecourse of a disease, condition, or a treatment of a disease orcondition.

The phenotype of a patient can be any observable characteristic or traitof a subject, such as a disease or condition, a disease stage orcondition stage, susceptibility to a disease or condition, prognosis ofa disease stage or condition, a physiological state, or response totherapeutics, etc. A phenotype can result from a subject's geneexpression as well as the influence of environmental factors and theinteractions between the two, as well as from epigenetic modificationsto nucleic acid sequences.

The phenotype in a subject can be characterized by analyzing cfDNA usingthe method described above. For example, characterizing a phenotype fora subject or individual may include detecting a disease or condition(including pre-symptomatic early stage detecting), determining theprognosis, diagnosis, or theranosis of a disease or condition, ordetermining the stage or progression of a disease or condition.Characterizing a phenotype can also include identifying appropriatetreatments or treatment efficacy for specific diseases, conditions,disease stages and condition stages, predictions and likelihood analysisof disease progression, particularly disease recurrence, metastaticspread or disease relapse. A phenotype can also be a clinically distincttype or subtype of a condition or disease, such as a cancer or tumor.Phenotype determination can also be a determination of a physiologicalcondition, or an assessment of organ distress or organ rejection, suchas post-transplantation. The products and processes described hereinallow assessment of a subject on an individual basis, which can providebenefits of more efficient and economical decisions in treatment.

In some embodiments, the method may be used to identify a signature thatpredicts whether a subject is likely to respond to a treatment for adisease or disorder.

Characterizing a phenotype may include predicting theresponder/non-responder status of the subject, wherein a responderresponds to a treatment for a disease and a non-responder does notrespond to the treatment. If a hydroxymethylation signature in a subjectmore closely aligns with that of previous subjects that were known torespond to the treatment, the subject can be characterized, orpredicted, as a responder to the treatment. Similarly, if thehydroxymethylation signature in the subject more closely aligns withthat of previous subjects that did not respond to the treatment, thesubject can be characterized, or predicted as a non-responder to thetreatment. The treatment can be for any appropriate disease, disorder orother condition. The method can be used in any disease setting where ahydroxymethylation signature that correlates withresponder/non-responder status is known.

In some embodiments, the phenotype comprises a disease or condition suchas those listed below. For example, the phenotype can comprise thepresence of or likelihood of developing a tumor, neoplasm, or cancer. Acancer detected or assessed by products or processes described hereinincludes, but is not limited to, breast cancer, ovarian cancer, lungcancer, colon cancer, hyperplastic polyp, adenoma, colorectal cancer,high grade dysplasia, low grade dysplasia, prostatic hyperplasia,prostate cancer, melanoma, pancreatic cancer, brain cancer (such as aglioblastoma), hematological malignancy, hepatocellular carcinoma,cervical cancer, endometrial cancer, head and neck cancer, esophagealcancer, gastrointestinal stromal tumor (GIST), renal cell carcinoma(RCC) or gastric cancer. The colorectal cancer can be CRC Dukes B orDukes C-D. The hematological malignancy can be B-Cell ChronicLymphocytic Leukemia, B-Cell Lymphoma-DLBCL, B-CellLymphoma-DLBCL-germinal center-like, B-Cell Lymphoma-DLBCL-activatedB-cell-like, and Burkitt's lymphoma.

In some embodiments, the phenotype may be a premalignant condition, suchas actinic keratosis, atrophic gastritis, leukoplakia, erythroplasia,lymphomatoid granulomatosis, preleukemia, fibrosis, cervical dysplasia,uterine cervical dysplasia, xeroderma pigmentosum, Barrett's Esophagus,colorectal polyp, or other abnormal tissue growth or lesion that islikely to develop into a malignant tumor. Transformative viralinfections such as HIV and HPV also present phenotypes that can beassessed according to the method.

The cancer characterized by the present method may be, withoutlimitation, a carcinoma, a sarcoma, a lymphoma or leukemia, a germ celltumor, a blastoma, or other cancers. Carcinomas include withoutlimitation epithelial neoplasms, squamous cell neoplasms squamous cellcarcinoma, basal cell neoplasms basal cell carcinoma, transitional cellpapillomas and carcinomas, adenomas and adenocarcinomas (glands),adenoma, adenocarcinoma, linitis plastica insulinoma, glucagonoma,gastrinoma, vipoma, cholangiocarcinoma, hepatocellular carcinoma,adenoid cystic carcinoma, carcinoid tumor of appendix, prolactinoma,oncocytoma, hurthle cell adenoma, renal cell carcinoma, grawitz tumor,multiple endocrine adenomas, endometrioid adenoma, adnexal and skinappendage neoplasms, mucoepidermoid neoplasms, cystic, mucinous andserous neoplasms, cystadenoma, pseudomyxoma peritonei, ductal, lobularand medullary neoplasms, acinar cell neoplasms, complex epithelialneoplasms, warthin's tumor, thymoma, specialized gonadal neoplasms, sexcord stromal tumor, thecoma, granulosa cell tumor, arrhenoblastoma,sertoli leydig cell tumor, glomus tumors, paraganglioma,pheochromocytoma, glomus tumor, nevi and melanomas, melanocytic nevus,malignant melanoma, melanoma, nodular melanoma, dysplastic nevus,lentigo maligna melanoma, superficial spreading melanoma, and malignantacral lentiginous melanoma. Sarcoma includes without limitation Askin'stumor, botryodies, chondrosarcoma, Ewing's sarcoma, malignant hemangioendothelioma, malignant schwannoma, osteosarcoma, soft tissue sarcomasincluding: alveolar soft part sarcoma, angiosarcoma, cystosarcomaphyllodes, dermatofibrosarcoma, desmoid tumor, desmoplastic small roundcell tumor, epithelioid sarcoma, extraskeletal chondrosarcoma,extraskeletal osteosarcoma, fibrosarcoma, hemangiopericytoma,hemangiosarcoma, kaposi's sarcoma, leiomyosarcoma, liposarcoma,lymphangiosarcoma, lymphosarcoma, malignant fibrous histiocytoma,neurofibrosarcoma, rhabdomyosarcoma, and synovialsarcoma. Lymphoma andleukemia include without limitation chronic lymphocytic leukemia/smalllymphocytic lymphoma, B-cell prolymphocytic leukemia, lymphoplasmacyticlymphoma (such as waldenstrom macroglobulinemia), splenic marginal zonelymphoma, plasma cell myeloma, plasmacytoma, monoclonal immunoglobulindeposition diseases, heavy chain diseases, extranodal marginal zone Bcell lymphoma, also called malt lymphoma, nodal marginal zone B celllymphoma (nmzl), follicular lymphoma, mantle cell lymphoma, diffuselarge B cell lymphoma, mediastinal (thymic) large B cell lymphoma,intravascular large B cell lymphoma, primary effusion lymphoma, burkittlymphoma/leukemia, T cell prolymphocytic leukemia, T cell large granularlymphocytic leukemia, aggressive NK cell leukemia, adult T cellleukemia/lymphoma, extranodal NK/T cell lymphoma, nasal type,enteropathy-type T cell lymphoma, hepatosplenic T cell lymphoma, blasticNK cell lymphoma, mycosis fungoides/sezary syndrome, primary cutaneousCD30-positive T cell lymphoproliferative disorders, primary cutaneousanaplastic large cell lymphoma, lymphomatoid papulosis,angioimmunoblastic T cell lymphoma, peripheral T cell lymphoma,unspecified, anaplastic large cell lymphoma, classical hodgkin lymphomas(nodular sclerosis, mixed cellularity, lymphocyte-rich, lymphocytedepleted or not depleted), and nodular lymphocyte-predominant hodgkinlymphoma. Germ cell tumors include without limitation germinoma,dysgerminoma, seminoma, nongerminomatous germ cell tumor, embryonalcarcinoma, endodermal sinus turmor, choriocarcinoma, teratoma,polyembryoma, and gonadoblastoma. Blastoma includes without limitationnephroblastoma, medulloblastoma, and retinoblastoma. Other cancersinclude without limitation labial carcinoma, larynx carcinoma,hypopharynx carcinoma, tongue carcinoma, salivary gland carcinoma,gastric carcinoma, adenocarcinoma, thyroid cancer (medullary andpapillary thyroid carcinoma), renal carcinoma, kidney parenchymacarcinoma, cervix carcinoma, uterine corpus carcinoma, endometriumcarcinoma, chorion carcinoma, testis carcinoma, urinary carcinoma,melanoma, brain tumors such as glioblastoma, astrocytoma, meningioma,medulloblastoma and peripheral neuroectodermal tumors, gall bladdercarcinoma, bronchial carcinoma, multiple myeloma, basalioma, teratoma,retinoblastoma, choroidea melanoma, seminoma, rhabdomyosarcoma,craniopharyngeoma, osteosarcoma, chondrosarcoma, myosarcoma,liposarcoma, fibrosarcoma, Ewing sarcoma, and plasmocytoma.

In a further embodiment, the cancer under analysis may be a lung cancerincluding non-small cell lung cancer and small cell lung cancer(including small cell carcinoma (oat cell cancer), mixed smallcell/large cell carcinoma, and combined small cell carcinoma), coloncancer, breast cancer, prostate cancer, liver cancer, pancreas cancer,brain cancer, kidney cancer, ovarian cancer, stomach cancer, skincancer, bone cancer, gastric cancer, breast cancer, pancreatic cancer,glioma, glioblastoma, hepatocellular carcinoma, papillary renalcarcinoma, head and neck squamous cell carcinoma, leukemia, lymphoma,myeloma, or a solid tumor.

In further embodiments, the cancer may be an acute lymphoblasticleukemia; acute myeloid leukemia; adrenocortical carcinoma; AIDS-relatedcancers; AIDS-related lymphoma; anal cancer; appendix cancer;astrocytomas; atypical teratoid/rhabdoid tumor; basal cell carcinoma;bladder cancer; brain stem glioma; brain tumor (including brain stemglioma, central nervous system atypical teratoid/rhabdoid tumor, centralnervous system embryonal tumors, astrocytomas, craniopharyngioma,ependymoblastoma, ependymoma, medulloblastoma, medulloepithelioma,pineal parenchymal tumors of intermediate differentiation,supratentorial primitive neuroectodermal tumors and pineoblastoma);breast cancer; bronchial tumors; Burkitt lymphoma; cancer of unknownprimary site; carcinoid tumor; carcinoma of unknown primary site;central nervous system atypical teratoid/rhabdoid tumor; central nervoussystem embryonal tumors; cervical cancer; childhood cancers; chordoma;chronic lymphocytic leukemia; chronic myelogenous leukemia; chronicmyeloproliferative disorders; colon cancer; colorectal cancer;craniopharyngioma; cutaneous T-cell lymphoma; endocrine pancreas isletcell tumors; endometrial cancer; ependymoblastoma; ependymoma;esophageal cancer; esthesioneuroblastoma; Ewing sarcoma; extracranialgerm cell tumor; extragonadal germ cell tumor; extrahepatic bile ductcancer; gallbladder cancer; gastric (stomach) cancer; gastrointestinalcarcinoid tumor; gastrointestinal stromal cell tumor; gastrointestinalstromal tumor (GIST); gestational trophoblastic tumor; glioma; hairycell leukemia; head and neck cancer; heart cancer; Hodgkin lymphoma;hypopharyngeal cancer; intraocular melanoma; islet cell tumors; Kaposisarcoma; kidney cancer; Langerhans cell histiocytosis; laryngeal cancer;lip cancer; liver cancer; malignant fibrous histiocytoma bone cancer;medulloblastoma; medulloepithelioma; melanoma; Merkel cell carcinoma;Merkel cell skin carcinoma; mesothelioma; metastatic squamous neckcancer with occult primary; mouth cancer; multiple endocrine neoplasiasyndromes; multiple myeloma; multiple myeloma/plasma cell neoplasm;mycosis fungoides; myelodysplastic syndromes; myeloproliferativeneoplasms; nasal cavity cancer; nasopharyngeal cancer; neuroblastoma;Non-Hodgkin lymphoma; nonmelanoma skin cancer; non-small cell lungcancer; oral cancer; oral cavity cancer; oropharyngeal cancer;osteosarcoma; other brain and spinal cord tumors; ovarian cancer;ovarian epithelial cancer; ovarian germ cell tumor; ovarian lowmalignant potential tumor; pancreatic cancer; papillomatosis; paranasalsinus cancer; parathyroid cancer; pelvic cancer; penile cancer;pharyngeal cancer; pineal parenchymal tumors of intermediatedifferentiation; pineoblastoma; pituitary tumor; plasma cellneoplasm/multiple myeloma; pleuropulmonary blastoma; primary centralnervous system (CNS) lymphoma; primary hepatocellular liver cancer;prostate cancer; rectal cancer; renal cancer; renal cell (kidney)cancer; renal cell cancer; respiratory tract cancer; retinoblastoma;rhabdomyosarcoma; salivary gland cancer; Sezary syndrome; small celllung cancer; small intestine cancer; soft tissue sarcoma; squamous cellcarcinoma; squamous neck cancer; stomach (gastric) cancer;supratentorial primitive neuroectodermal tumors; T-cell lymphoma;testicular cancer; throat cancer; thymic carcinoma; thymoma; thyroidcancer; transitional cell cancer; transitional cell cancer of the renalpelvis and ureter; trophoblastic tumor; ureter cancer; urethral cancer;uterine cancer; uterine sarcoma; vaginal cancer; vulvar cancer;Waldenstrom macroglobulinemia; or Wilms tumor. The methods of theinvention can be used to characterize these and other cancers. Thus,characterizing a phenotype can be providing a diagnosis, prognosis ortheranosis of one of the cancers disclosed herein.

The phenotype can also be an inflammatory disease, immune disease, orautoimmune disease. For example, the disease may be inflammatory boweldisease (IBD), Crohn's disease (CD), ulcerative colitis (UC), pelvicinflammation, vasculitis, psoriasis, diabetes, autoimmune hepatitis,Multiple Sclerosis, Myasthenia Gravis, Type I diabetes, RheumatoidArthritis, Psoriasis, Systemic Lupus Erythematosis (SLE), Hashimoto'sThyroiditis, Grave's disease, Ankylosing Spondylitis Sjogrens Disease,CREST syndrome, Scleroderma, Rheumatic Disease, organ rejection, PrimarySclerosing Cholangitis, or sepsis.

The phenotype can also comprise a cardiovascular disease, such asatherosclerosis, congestive heart failure, vulnerable plaque, stroke, orischemia. The cardiovascular disease or condition can be high bloodpressure, stenosis, vessel occlusion or a thrombotic event.

The phenotype can also comprise a neurological disease, such as MultipleSclerosis (MS), Parkinson's Disease (PD), Alzheimer's Disease (AD),schizophrenia, bipolar disorder, depression, autism, Prion Disease,Picks disease, dementia, Huntington disease (HD), Down's syndrome,cerebrovascular disease, Rasmussen's encephalitis, viral meningitis,neurospsychiatric systemic lupus erythematosus (NPSLE), amyotrophiclateral sclerosis, Creutzfeldt-Jacob disease,Gerstmann-Straussler-Scheinker disease, transmissible spongiformencephalopathy, ischemic reperfusion damage (e.g. stroke), brain trauma,microbial infection, or chronic fatigue syndrome. The phenotype may alsobe a condition such as fibromyalgia, chronic neuropathic pain, orperipheral neuropathic pain.

The phenotype may also comprise an infectious disease, such as abacterial, viral or yeast infection. For example, the disease orcondition may be Whipple's Disease, Prion Disease, cirrhosis,methicillin-resistant Staphylococcus aureus, HIV, hepatitis, syphilis,meningitis, malaria, tuberculosis, or influenza. Viral proteins, such asHIV or HCV-like particles can be assessed in a vesicle, to characterizea viral condition.

The phenotype can also comprise a perinatal or pregnancy relatedcondition (e.g. preeclampsia or preterm birth), metabolic disease orcondition, such as a metabolic disease or condition associated with ironmetabolism. For example, hepcidin can be assayed in a vesicle tocharacterize an iron deficiency. The metabolic disease or condition canalso be diabetes, inflammation, or a perinatal condition.

A correlative “signature” may be a group of 1, 2, 3, 4, 5, 6, 7, 8, 9 or10 or more sequences that are independently eitherunder-hydroxymethylated or over-hydroxymethylated relative to a control(e.g., “normal” cfDNA), where, collectively the identity of thesequences and, optionally, the amount of hydroxymethylation associatedwith those sequences, correlates with a phenotype.

The cfDNA used in the method may be from a mammal such as bovine, avian,canine, equine, feline, ovine, porcine, or primate animals (includinghumans and non-human primates). In some embodiments, the subject canhave a pre-existing disease or condition, such as cancer. Alternatively,the subject may not have any known pre-existing condition. The subjectmay also be non-responsive to an existing or past treatment, such as atreatment for cancer. In some embodiments, the cfDNA may be from apregnant female. In some embodiments, the hydroxymethylation pattern inthe fetal fraction of the cfDNA may correlate with a chromosomalabnormality in the fetus (e.g., an aneuploidy). In other embodiments,one can determine the sex of the fetus from the hydroxymethylationpattern in the fetal fraction of the cfDNA and/or determine the fetalfraction of the cfDNA.

A method that comprises (a) obtaining a sample comprising circulatingcell-free DNA, (b) enriching for the hydroxymethylated DNA in the sampleand (c) independently quantifying the amount of nucleic acids in theenriched hydroxymethylated DNA that map to (i.e., have sequences thatcorrespond to) each of one or more target loci (e.g., at least 1, atleast 2, at least 3, at least 4, at least 5 or at least 10 target loci)is also provided. This method may further comprise: (d) determiningwhether one or more nucleic acid sequences in the enrichedhydroxymethylated DNA are over-represented or under represented in theenriched hydroxymethylated DNA, relative to a control. The identity ofthe nucleic acids that are over-represented or under represented in theenriched hydroxymethylated DNA (and, in certain cases the extent tothose nucleic acids are over-represented or under represented in theenriched hydroxymethylated DNA) can be use to make a diagnosis, atreatment decision or a prognosis. For example, in some cases, analysisof the enriched hydroxymethylated DNA may identify a signature thatcorrelates with a phenotype, as discussed above. In some embodiments,the amount of nucleic acid molecules in the enriched hydroxymethylatedDNA that map to each of one or more target loci (e.g., thegenes/intervals listed below) may be quantified by qPCR, digital PCR,arrays, sequencing or any other quantitative method.

In some embodiments, the diagnosis, treatment decision or prognosis maybe a cancer diagnosis. In these embodiments, the target loci may includeone or more (e.g., at least 1, at least 2, at least 3, at least 4, atleast 5, at least 10, at least 15 or at least 20, of the following genebodies (i.e., transcribed regions of a gene): ABRACL, ADAMTS4, AGFG2,ALDH1A3, ALG10B, AMOTL1, APCDD1L-AS1, ARL6IP6, ASF1B, ATP6V0A2, AUNIP,BAGE, C2orf62, C8orf22, CALCB, CC2D1B, CCDC33, CCNL2, CLDN15, COMMD6,CPLX2, CRP, CTRC, DACH1, DAZL, DDX11L1, DHRS3, DUSP26, DUSP28, EPN3,EPPIN-WFDC6, ETAA1, FAM96A, FENDRR, FLJ16779, FLJ31813, GBX1, GLP2R,GMCL1P1, GNPDA2, GPR26, GSTP1, HMOX2, HOXC5, IGSF9B, INSC, INSL4, IRF7,KIF16B, KIF20B, LARS, LDHD, LHX5, LINC00158, LINC00304, LOC100128946,LOC100131234, LOC100132287, LOC100506963, LOC100507250, LOC100507410,LOC255411, LOC729737, MAFF, NPAS4, NRADDP, P2RX2, PAIP1, PAX1, PODXL2,POU4F3, PSMG1, PTPN2, RAG1, RBM14-RBM4, RDH11, RFPL3, RNF122, RNF223,RNF34, SAMD11, SHISA2, SIGLEC10, SLAMF7, SLC25A46, SLC25A47, SLC9A3R2,SORD, SOX18, SPATA31E1, SSR2, STXBP3, SYT11, SYT2, TCEA3, THAP7-AS1,TMEM168, TMEM65, TMX2, TPM4, TPO, TRAM1, TTC24, UBQLN4, WASH7P, ZNF284,ZNF423, ZNF444, ZNF800, ZNF850, and ZRANB2.

For example, in some embodiments, the amount of nucleic acids that mapto each of one or more (e.g., at least 1, at least 2, at least 3, atleast 4, at least 5 or at least 10) of the following gene bodies:ZNF800, TMEM65, GNPDA2, ALG10B, CLDN15, TMEM168, ETAA1, AMOTL1, STXBP3,ZNF444, LINC00158, IRF7, SLC9A3R2, TRAM1 and SLC25A46 may beindependently determined, as shown in FIG. 12D.

In another example, in some embodiments, the amount of nucleic acidsthat map to each of one or more (e.g., at least 1, at least 2, at least3, at least 4, at least 5 or at least 10) of the following gene bodies:CLDN15, SLC25A47, ZRANB2, LOC10050693, STXBP3, GPR26, P2RX2,LOC100507410, LHX5, HOXC5, FAM96A, CALCB, RNF223, SHISA2 and SLAMF7 maybe independently determined, as shown in FIG. 12F.

In these embodiments, the target loci may include one or more (e.g., atleast 1, at least 2, at least 3, at least 4, at least 5, at least 10, orat least 15) of the following intervals (where the numbering is relativeto the hg19 reference genome, released as GRCh37 in February 2009):chr1:114670001-114672000, chr1:169422001-169424000,chr1:198222001-198224000, chr1:239846001-239848000,chr1:24806001-24808000, chr1:3234001-3236000, chr1:37824001-37826000,chr1:59248001-59250000, chr1:63972001-63974000, chr1:67584001-67586000,chr1:77664001-77666000, chr2:133888001-133890000,chr2:137676001-137678000, chr2:154460001-154462000,chr2:200922001-200924000, chr2:213134001-213136000,chr2:219148001-219150000, chr2:41780001-41782000,chr2:49900001-49902000, chr3:107894001-107896000,chr3:108506001-108508000, chr3:137070001-137072000,chr3:17352001-17354000, chr3:23318001-23320000, chr3:87312001-87314000,chr3:93728001-93730000, chr4:39342001-39344000, chr4:90790001-90792000,chr5:103492001-103494000, chr5:39530001-39532000,chr5:83076001-83078000, chr6:122406001-122408000,chr6:129198001-129200000, chr6:156800001-156802000,chr6:157286001-157288000, chr6:45304001-45306000,chr7:11020001-11022000, chr7:13364001-13366000, chr8:42934001-42936000,chr8:53686001-53688000, chr8:69672001-69674000, chr9:3496001-3498000 andchr9:88044001-88046000.

For example, in some embodiments, the amount of nucleic acids that mapto each of one or more (e.g., at least 1, at least 2, at least 3, atleast 4, at least 5 or all of) of the following intervals:chr4:90790001-90792000, chr6:45304001-45306000,chr5:103492001-103494000, chr7:11020001-11022000,chr2:49900001-49902000, chr2:137676001-137678000,chr3:87312001-87314000, and chr9:88044001-88046000 may be independentlydetermined, as shown in FIG. 12E.

In another example, in some embodiments, the amount of nucleic acidsthat map to each of one or more (e.g., at least 1, at least 2, at least3, at least 4, at least 5 or all of) of the following intervals:chr4:90790001-90792000, chr6:45304001-45306000,chr1:169422001-169424000, chr1:67584001-67586000,chr5:103492001-103494000, chr3:87312001-87314000,chr2:219148001-219150000, chr1:198222001-198224000,chr8:53686001-53688000, chr1:239846001-239848000,chr3:23318001-23320000, chr6:122406001-122408000, chr9:3496001-3498000,chr1:24806001-24808000, and chr8:69672001-69674000, as shown in FIG.12G.

If the diagnosis is a diagnosis of cancer, then the diagnosis mayinclude an indication of the tissue-type of the cancer, i.e., whetherthe cancer is lung cancer, liver cancer, pancreatic cancer, etc.

As would be apparent, the quantification step (c) may be done using avariety of different methods. For example, as described above and below,the quantification may be done by attaching molecule identifiersequences to the enriched fragments, sequencing them, and then countingthe number of molecular identifier sequences that are associated withsequences reads that map to the one or more loci (see, e.g.,US20110160078). Alternatively, the quantification may be done by digitalPCR (see, e.g., Kalinina et al, Nucleic Acids Research. 1997 25 (10):1999-2004) or hybridization to an array, for example.

In some embodiments, the cfDNA sample can be additionally analyzed bythe imaging method described in Song et al (Proc. Natl. Acad. Sci. 2016113: 4338-43), which is incorporated by reference herein. In theseembodiments, the method may comprise (a) labeling a sample comprisingthe cfDNA by: (i) adding a capture tag to the ends of the DNA moleculesin the sample; and (ii) labeling molecules that comprisehydroxymethylcytosine with a first fluorophore; (b) immobilizing the DNAmolecules labeled made in step (a) on a support; and (c) imagingindividual molecules of hydroxymethylated DNA on the support. In someembodiments, this method may comprise (d) counting the number ofindividual molecules labeled with the first fluorophore, therebydetermining the number of hydryoxymethylated DNA molecules in thesample. In these embodiments, the first fluorophore of step (a)(ii) isadded by incubating DNA molecules with a DNA β-glucosyltransferase andUDP glucose modified with a chemoselective group, thereby covalentlylabeling the hydroxymethylated DNA molecules with the chemoselectivegroup, and linking the first fluorophore to thechemoselectively-modified DNA via a cycloaddition reaction. In someembodiments, step (a)(i) may further comprises adding a secondfluorophore to the ends of the DNA molecules in the sample. In someembodiments, step (a) may further comprise: after step (ii), (iii)labeling molecules that comprise methylcytosine with a secondfluorophore; and step (c) further comprises imaging individual moleculesof methylated DNA on the support. In these embodiments, the method maycomprise (d) counting: (i) the number of individual molecules labeledwith the first fluorophore and (ii) the number of individual moleculeslabeled with the second fluorophore. In these embodiments, the methodmay further comprise (e) calculating the relative amounts ofhydroxymethylated DNA and methylated DNA in the sample. In someembodiments the molecules that comprise methylcytosine are labeled withthe second fluorophore by: incubating the product of step (a)(ii) with amethylcytosine dioxygenase, thereby converting methylcytosine intohydroxymethylcytosine; incubating the methylcytosine dioxygenase-treatedDNA with a DNA β-glucosyltransferase and UDP glucose modified with achemoselective group, thereby covalently labeling the hydroxymethylatedDNA molecules with the chemoselective group, and linking the secondfluorophore to the chemoselectively-modified DNA via a cycloadditionreaction.

In this method, step (a) may further comprise: iii. labeling moleculesthat comprise methylcytosine with a second fluorophore; and step (c) maycomprise imaging individual molecules of genomic DNA by detecting a FRET(fluorescence resonance energy transfer) signal emanating from the firstor second fluorophores of (a)(ii) or (a)(iii), wherein a FRET signalindicates that a molecule has a hydroxymethylcytosine and amethylcytosine that are proximal to one another. In these embodiments,the method may comprise determining if the molecule has a proximalhydroxymethylcytosine and methylcytosine on the same strand.Alternatively or in addition, the method may comprise determining if themolecule has a proximal hydroxymethylcytosine and methylcytosine ondifferent strands.

The hydroxymethylcytosine/methylcytosine status of the genes/intervalslisted in Tables 10A, 10B, 11A and 11B can be investigated using anarray of probes. For example, in some embodiments, the method maycomprise attaching labels to DNA molecules that comprise one or morehydroxymethylcytosine and methylcytosine nucleotides in a cfDNA sample,wherein the hydroxymethylcytosine nucleotides are labeled with a firstoptically detectable label (e.g., a first fluorophore) and themethylcytosine nucleotides are labeled with a second opticallydetectable label (e.g., a second fluorophore) that is distinguishablefrom the first label, to produce a labeled sample, and hybridizing thesample with an array of probes, where the array of probes comprisesprobes for at least 1, at least 2, at least 3, at least 4, at least 5,at least 10 or at least 20 of the genes or intervals listed in Tables10A, 10B, 11A and 11B. In some cases, the array may contain top strandprobes and bottom strand probes, thereby allowing the labeled top andbottom strands to be detected independently.

In some embodiments, the method may comprise attaching labels to DNAmolecules that comprise one or more hydroxymethylcytosine andmethylcytosine nucleotides in a sample of cfDNA, wherein thehydroxymethylcytosine nucleotides are labeled with a first capture tagand the methylcytosine nucleotides are labeled with a second capture tagthat is different to the first capture, to produce a labeled sample;enriching for the DNA molecules that are labeled; and sequencing theenriched DNA molecules. This embodiment of the method may compriseseparately enriching the DNA molecules that comprise one or morehydroxymethylcytosines and the DNA molecules that comprise one or moremethylcytosine nucleotides. The labeling may be adapted from the methodsdescribed above or from Song et al (Proc. Natl. Acad. Sci. 2016 113:4338-43), where capture tags are used instead of fluorescent labels. Forexample, in some embodiments the method may comprise incubating thecfDNA (e.g., adaptor-ligated cfDNA) with a DNA β-glucosyltransferase andUDP glucose modified with a chemoselective group, thereby covalentlylabeling the hyroxymethylated DNA molecules in the cfDNA with thechemoselective group; linking a first capture agent to thechemoselectively-modified cfDNA via the chemoselective group, e.g., viaa cycloaddition reaction; incubating this product of step with amethylcytosine dioxygenase, a DNA β-glucosyltransferase and UDP glucosemodified with a chemoselective group; and linking the second captureagent to the chemoselectively-modified DNA via the chemoselective group,e.g., via a cycloaddition reaction.

In some embodiments, the determining step may be done relative to acontrol. Specifically, in some embodiments, the method may comprisedetermining whether one or more nucleic acid sequences in the enrichedhydroxymethylated DNA are over-represented, relative to a control and/ordetermining whether one or more nucleic acid sequences in the enrichedhydroxymethylated DNA are under-represented relative to a control. Insome embodiments, the control sequences may be in the enrichedhydroxymethylated DNA. In these embodiments, the control sequences maybe in the same sample as the nucleic acids that map to the target loci,but they do not map to the target loci. In other embodiments, thecontrol sequences may be in in the sample of (a), in the samplecomprising circulating cell-free DNA, prior to enrichment for thehydroxymethylated DNA. In other embodiments, the control sequences maybe in in the sample of (a), in the sample comprising circulatingcell-free DNA, after enrichment for the hydroxymethylated DNA (i.e., inthe fraction of circulating cell-free DNA that does not contain thehydroxymethylated DNA. In other embodiments, the control sequences canbe from a different sample. In other embodiments, the determination maybe based on a empirically-derived threshold obtained from analysis ofmultiple samples.

Kits

Also provided by this disclosure are kits that contain reagents forpracticing the subject methods, as described above. The subject kitscontain one or more of any of the components described above. Forexample, in some embodiments, the kit may be for analyzing cfDNA. Inthese embodiments, the kit may comprise a DNA β-glucosyltransferase, UDPglucose modified with a chemoselective group; and an adaptor comprisinga molecular barcode, as described above. In some embodiments, theadaptor may be a Y or hairpin adaptor. In some embodiments, the kit mayalso comprise a biotin moiety, wherein the biotin moiety is reactivewith the chemoselective group.

The various components of the kit may be present in separate containersor certain compatible components may be precombined into a singlecontainer, as desired.

In addition to above-mentioned components, the subject kits may furtherinclude instructions for using the components of the kit to practice thesubject methods, i.e., instructions for sample analysis. Theinstructions for practicing the subject methods are generally recordedon a suitable recording medium. For example, the instructions may beprinted on a substrate, such as paper or plastic, etc. As such, theinstructions may be present in the kits as a package insert, in thelabeling of the container of the kit or components thereof (i.e.,associated with the packaging or subpackaging), etc. In otherembodiments, the instructions are present as an electronic storage datafile present on a suitable computer readable storage medium, e.g.,CD-ROM, diskette, etc. In yet other embodiments, the actual instructionsare not present in the kit, but means for obtaining the instructionsfrom a remote source, e.g., via the internet, are provided. An exampleof this embodiment is a kit that includes a web address where theinstructions can be viewed and/or from which the instructions can bedownloaded. As with the instructions, this means for obtaining theinstructions is recorded on a suitable substrate.

Compositions

Also provided by this disclosure are a variety of composition thatcomprise products made by the present method. In some embodiments, thecomposition may comprise circulating cell-free DNA, wherein thehydroxymethylcytosines residues in the DNA are modified to contain acapture tag. In these embodiments, the both strands of the circulatingcell-free DNA may be in the composition. In some embodiments, the DNAmay be in double-stranded form. In other embodiments, the DNA may be insingle stranded form (e.g., if the composition has been denatured byincubation at an elevated temperature, for example.

As would be apparent from the description in the methods section of thisdisclosure, the capture tag may be a biotin moiety (e.g., biotin) or achemoselective group (e.g., an azido group and an alkynyl group such asUDP-6-N3-Glu). In some embodiments, the composition may furthercomprise: i. β-glucosyltransferase and ii. UDP glucose modified with achemoselective group (e.g., UDP-6-N3-Glu). These molecules are notfluorescently labeled, or labeled with an optically detectable label.

In some embodiments, the cell-free hydroxymethylated DNA isadaptor-ligated (i.e., has been ligated to adaptors). In someembodiments, the DNA may have adaptors, e.g., double-stranded, Y orhairpin adaptors, ligated to both strands at both ends.

In some embodiments, the composition may be an enriched composition inthat at least 10% (e.g., at least 20%, at least 50%, at least 80% or atleast 90%) of the nucleic acid molecules in the composition comprise oneor more hydroxymethylcytosines that are modified to contain the capturetag. In these embodiments, the composition may further comprise, insolution, copies of the cell-free hydroxymethylated DNA that have beenmade by PCR. In these embodiments, the composition may comprise apopulation of PCR products, wherein at least 10% (e.g., at least 20%, atleast 50%, at least 80% or at least 90%) of the PCR products are copied(directly or indirectly) from hydroxymethylated DNA.

In some embodiments, the composition may further comprise a support(e.g., a bead such as a magnetic bead or another solid), wherein thesupport and circulating cell-free DNA are linked to one another via thecapture tag. The linkage may be via a covalent bond or a a non-covalentbond. As would be apparent, the support may be linked to streptavidinand the capture agent may be linked to biotin.

EXAMPLES

Aspects of the present teachings can be further understood in light ofthe following examples, which should not be construed as limiting thescope of the present teachings in any way.

Reported herein is the first global analysis of hydroxymethylome incfDNA. In lung cancer a characteristic global loss of cell-free 5hmC wasobserved, while in HCC and pancreatic cancer significant finer scalechanges of cell-free 5hmC were identified. In HCC, an exploratory studyof the longitudinal samples was conducted, and it was demonstrated thatcell-free 5hmC can be used to monitor treatment and recurrence. Thesethree types of cancer displayed distinct patterns in their cell-freehydroxmethylome and we could employ machine learning algorithms trainedwith cell-free 5hmC features to predict the three cancer types with highaccuracy. It is anticipated that cell-free 5hmC profiling will be avaluable tool for cancer diagnostics, as well as for other diseaseareas, including but not limited to neurodegenerative diseases,cardiovascular diseases and diabetes. Additionally, the generalframework of this method can be readily adopted to sequence othermodifications in cell-free nucleic acids by applying the appropriatelabeling chemistry to the modified bases. This will allow acomprehensive and global overview of genetic and epigenetic changes ofvarious disease states, and further increase the power of personalizeddiagnostics.

This data was obtained using a low-input whole-genome cell-free 5hmCsequencing method adapted from a selective chemical labeling known as“hMe-Seal” (see, e.g., Song et al, Nat. Biotechnol. 2011 29, 68-72).hMe-Seal is a robust method that uses (3-glucosyltransferase (βGT) toselectively label 5hmC with a biotin via an azide-modified glucose forpull-down of 5hmC-containing DNA fragments for sequencing (See, FIG.5A). Standard hMe-Seal procedure requires micrograms of DNA. In themodified approach described herein, cfDNA was first ligated withsequencing adapters and 5hmC was selectively labeled with a biotingroup. After capturing cfDNA containing 5hmC using streptavidin beads,the final library is made by PCR directly from the beads instead ofeluting the captured DNA. This minimize sample loss during purification.The method is schematically illustrated in FIG. 1A).

Materials and Methods Sample Collection and Processing

Samples for healthy subjects were obtained from Stanford blood center.HCC and breast cancer patients were recruited in a Stanford UniversityInstitutional Review Board-approved protocol. Lung cancer, pancreaticcancer, GBM, gastric cancer and colorectal cancer patients wererecruited in a West China Hospital Institutional Review Board-approvedprotocol. All recruited subjects gave informed consent. Blood wascollected into EDTA-coated Vacutainers. Plasma was collected from theblood samples after centrifugation at 1,600×g for 10 min at 4° C. and16,000×g at 10 min at 4° C. cfDNA was extracted using the CirculatingNucleic Acid Kit (Qiagen). Whole blood genomic DNA was extracted usingthe DNA Mini Kit (Qiagen) and fragmented using dsDNA Fragmentase (NEB)into average 300 bp. DNA was quantified by Qubit Fluorometer (LifeTechnologies). Cell-free RNA was extracted using the Plasma/SerumCirculating and Exosomal RNA Purification Kit (Norgen). The extractedcell-free RNA was further digested using Baseline-ZERO DNases(Epicentre) and depleted using Ribo-Zero rRNA Removal Kit (Epicentre)according to a protocol from Clontech.

Spike-In Amplicon Preparation

To generate the spiked-in control, lambda DNA was PCR amplified by TaqDNA Polymerase (NEB) and purified by AMPure XP beads (Beckman Coulter)in nonoverlapping ˜180 bp amplicons, with a cocktail of dATP/dGTP/dTTPand one of the following: dCTP, dmCTP, or 10% dhmCTP (Zymo)/90% dCTP.Primers sequences are as follows: dCTP FW-CGTTTCCGTTCTTCTTCGTC (SEQ IDNO:1), RV-TACTCGCACCGAAAATGTCA (SEQ ID NO:2), dmCTPFW-GTGGCGGGTTATGATGAACT (SEQ ID NO:3), RV-CATAAAATGCGGGGATTCAC (SEQ IDNO:4), 10% dhmCTP/90% dCTP FW-TGAAAACGAAAGGGGATACG (SEQ ID NO:5),RV-GTCCAGCTGGGAGTCGATAC (SEQ ID NO:6).

5hmC Library Construction, Labeling, Capture and High-ThroughputSequencing

cfDNA (1-10 ng) or fragmented whole blood genomic DNA (1 μg) spiked withamplicons (0.001 pg of each amplicon per 10 ng DNA) was end repaired,3′-adenylated and ligated to DNA Barcodes (Bioo Scientific) using KAPAHyper Prep Kit (Kapa Biosystems) according to the manufacturer'sinstructions. Ligated DNA was incubated in a 25 μL solution containing50 mM HEPES buffer (pH 8), 25 mM MgCl₂, 100 μM UDP-6-N3-Glc (ActiveMotif), and 12.5 U βGT (Thermo) for 2 hr at 37° C. After that, 2.5 μLDBCO-PEG4-biotin (Click Chemistry Tools, 20 mM stock in DMSO) wasdirectly added to the reaction mixture and incubated for 2 hr at 37° C.Next, 10 μg sheared salmon sperm DNA (Life Technologies) was added intothe reaction mixture and the DNA was purified by Micro Bio-Spin 30Column (Bio-Rad). The purified DNA was incubated with 0.5 μL M270Streptavidin beads (Life Technologies) pre-blocked with salmon sperm DNAin buffer 1 (5 mM Tris pH 7.5, 0.5 mM EDTA, 1 M NaCl and 0.2% Tween 20)for 30 min. The beads were subsequently undergone three 5-min washeseach with buffer 1, buffer 2 (buffer 1 without NaCl), buffer 3 (buffer 1with pH 9) and buffer 4 (buffer 3 without NaCl). All binding and washingwere done at room temperature with gentle rotation. Beads were thenresuspended in water and amplified with 14 (cfDNA) or 9 (whole bloodgenomic DNA) cycles of PCR amplification using Phusion DNA polymerase(NEB). The PCR products were purified using AMPure XP beads. Separateinput libraries were made by direct PCR from ligated DNA withoutlabeling and capture. For technical replicates, cfDNA from the samesubject was divided into two technical replicates. Pair-end 75 bpsequencing was performed on the NextSeq instrument.

Data Processing and Gene Body Analysis

FASTQ sequences were aligned to UCSC/hg19 with Bowtie2 v2.2.5 andfurther filtered with samtools-0.1.19 (view -f 2 -F 1548 -q 30 andrmdup) to retain unique nonduplicate matches to the genome. Pair-endreads were extended and converted into bedgraph format normalized to thetotal number of aligned reads using bedtools, and then converted tobigwig format using bedGraphToBigWig from the UCSC Genome Browser forvisualization in Integrated Genomics Viewer. FASTQ sequences were alsoaligned to the three spike-in control sequences to evaluate thepull-down efficiency. The spike-in control is only used as a validationof successful pull-down in each sample. hMRs were identified with MACSusing unenriched input DNA as background and default setting (p-valuecutoff 1e-5). Genomic annotations of hMRs were performed by determiningthe percentage of hMRs overlapping each genomic regions ≥1 bp. Metageneprofile was generated using ngs.plot. 5hmC FPKM were calculated usingthe fragment counts in each RefSeq gene body obtained by bedtools. Fordifferential analyses, genes shorter than 1 kb or mapped to chromosome Xand Y were excluded. Differential genic 5hmC analysis was performedusing the limma package in R. GO analyses were performed using DAVIDBioinformatics Resources with GOTERM_BP_FAT. Tissue-specific geneexpression was obtained from BioGPS. For tSNE plot, the Pearsoncorrelation of gene body 5hmC FPKM was used as the distance matrix totSNE. MA-plot, hierarchical clustering, tSNE, LDA, and heatmaps weredone in R.

Cancer Type and Stage Prediction

Cancer type-specific marker genes were selected by performing studentt-test between 1) one cancer group and healthy group, 2) one cancergroup and other cancer samples, 3) two different cancer groups.Benjamini and Hochberg correction was then performed for the raw p-valueand the genes were then sorted by q-value. The top 5-20 genes withsmallest q-value were selected as feature set to train the classifier.To achieve higher resolution, DhMRs were identified by first breakingthe reference genome (hg19) into 2 kb windows in silico and calculating5hmC FPKM value for each of the window. Blacklisted genomic regions thattend to show artifact signal according to ENCODE were filtered beforedown-stream analysis. For cancer type-specific DhMRs, student t-test andBenjamini and Hochberg correction of p-values were performed forcomparison between each cancer type and healthy controls. The top 2-10DhMRs with smallest q-value were chosen for each cancer type. Randomforest and Gaussian model-based Mclust classifier were performed on thedataset using previously described features (gene bodies and DhMRs).Classifiers were trained on lung cancer, pancreatic cancer, HCC andhealthy samples. Parameters for random forest analysis, including randomseed and mtry (number of variables randomly sampled as candidates ateach split), were fine-tuned for lowest out-of-bag estimate of errorusing tuneRF in randomForest package in R. The top 15 features withhighest variable importance were plotted. Normal mixture model analysiswas performed using Mclust R package. For Mclust model-based classifiertraining, bayesian information criterion (BIC) plot was performed forvisualization of the classification efficacy of different multivariatemixture models. By default, EEI model (diagonal, equal volume and shape)and EDDA model-type (single component for each class with the samecovariance structure among classes) were chosen for Mclustclassification. To strengthen the analysis, leave-one-out (LOO)cross-validation was performed for random forest and Mclust classifierwith the same parameter values. For Mclust cross-validation, cvMclustDAin the Mclust R package was used.

Cell-Free RNA Library Construction and High-Throughput Sequencing

Cell-free RNA library was prepared using ScriptSeq v2 RNA-Seq LibraryPreparation Kit (Epicentre) following the FFPE RNA protocol with 19cycles of PCR amplification. The PCR products were then purified usingAMPure XP beads. Pair-end 75 bp sequencing was performed on the NextSeqinstrument. RNA-seq reads were first trimmed using Trimmomatic-0.33 andthen aligned using tophat-2.0.14. RPKM expression values were extractedusing cufflinks-2.2.1 using RefSeq gene models.

Results and Discussion

Cell-free 5hmC readily from a sample that contains less than 10 ng ofcfDNA (e.g., 1-10 ng of cfDNA) using the method described above. Byspiking in a pool of 180 bp amplicons bearing C, 5mC, or 5hmC to cfDNA,it was demonstrated that only 5hmC-containing DNA can be detected by PCRfrom the beads after pull-down (FIG. 5B). This result was confirmed inthe final sequencing libraries, which showed over 100-fold enrichment inreads mapping to 5hmC spike-in DNA (FIG. 1B). Furthermore, our approachperformed equally well with cfDNA and bulk genomic DNA (1 μg whole bloodgenomic DNA (gDNA)) (FIG. 1B). The final cell-free 5hmC libraries arehighly complex with a median unique nonduplicate map rate of 0.75 whenlightly sequenced (median 15 million reads, ˜0.5-fold human genomecoverage) (FIGS. 5C-5D, and Table 1 below), and yet technical replicatesare highly reproducible (FIG. 1E). 5hmC-enriched regions (hMRs) wereidentified in the sequence data using a poisson-based method. hMRs arehighly concordant between technical replicates and a pooled sample: over75% of hMRs in the pooled sample are in common with each of thereplicates (FIG. 5F), reaching the ENCODE standard for ChIP-Seq. Theseresults demonstrated cell-free 5hmC can be readily and reliably profiledby the modified hMe-Seal method.

TABLE 1 Summary of 5hmC sequencing results. total reads uniquenonduplicate unique nonduplicate sample ID type sequenced mapped readsmapped rate 10 healthy cfDNA 20081973 15192613 0.76 11 healthy cfDNA19142986 14762956 0.77 27 healthy cfDNA 21862078 16645192 0.76 35-1 §healthy cfDNA 29132339 16742468 0.57 35-2 § healthy cfDNA 2869421817346511 0.60 36-1 § healthy cfDNA 32202519 20996955 0.65 36-2 § healthycfDNA 31089686 20993595 0.68 38o healthy cfDNA 20124203 15295376 0.76 38healthy cfDNA 20419287 15679281 0.77 39o healthy cfDNA 22320662 178331760.80 input † cfDNA input 38574253 25910419 0.67 35-blood whole bloodgDNA 44077590 31654982 0.72 36-blood whole blood gDNA 40843066 292661690.72 blood-input † whole blood gDNA input 39138506 26455609 0.68 lung293lung cancer 14172402 11470840 0.81 lung323 lung cancer 12269885 89165940.73 lung324 lung cancer 13313728 10058078 0.76 lung395 lung cancer13589263 10092883 0.74 lung417 lung cancer 13212811 10109574 0.77lung418 lung cancer 13103903 10420656 0.80 lung419 lung cancer 119493569704240 0.81 lung492 lung cancer 12563742 8885504 0.71 lung493 lungcancer 12930120 10479700 0.81 lung496 lung cancer 12267496 9657956 0.79lung512 lung cancer 12934833 10483836 0.81 lung513 lung cancer 113100888304508 0.73 lung514 lung cancer 12895079 10264145 0.80 lung515 lungcancer 12132995 9406700 0.78 lung517 lung cancer 11766082 8857054 0.75HCC150 HCC 15215190 11298385 0.74 HCC237 HCC 13439935 10109197 0.75HCC241 HCC 16201676 12017320 0.74 HCC256 HCC 14579945 10728759 0.74HCC260 HCC 13791503 10021911 0.73 HCC285 HCC 11522024 7662330 0.67HCC290 HCC 13162465 9271065 0.70 HCC320 HCC 13462633 9696240 0.72 HCC341HCC 11199473 6497400 0.58 HCC628 HCC 15365745 11759122 0.77 HCC324 HCC12525818 9598812 0.77 HCC46 HCC 13121530 9237102 0.70 HCC73 HCC 1381668610745247 0.78 HCC489 HCC 11446887 5575387 0.49 HCC195 HCC 115387777701351 0.67 HCC234 HCC 11960087 8468478 0.71 HCC626 HCC 1355271211087605 0.82 HCC647 HCC 12491614 8590321 0.69 pancreatic27 pancreaticcancer 9717087 8019436 0.83 pancreatic68 pancreatic cancer 104571098374219 0.80 pancreatic69 pancreatic cancer 10838005 8940883 0.82pancreatic75 pancreatic cancer 10197772 8452749 0.83 pancreatic9pancreatic cancer 14601356 11245279 0.77 pancreatic15 pancreatic cancer15240467 11923009 0.78 pancreatic22 pancreatic cancer 13439343 103563950.77 GBM57 GBM 8799132 6455359 0.73 GBM58 GBM 8874810 7253089 0.82 GBM66GBM 9795211 8073651 0.82 GBM76 GBM 8103209 6165341 0.76 stomach1 gastriccancer 14282633 10365849 0.73 stomach2 gastric cancer 17825012 129388720.73 stomach3 gastric cancer 16979690 12894400 0.76 stomach4 gastriccancer 21192604 15675499 0.74 stomach8 gastric cancer 14070772 83215490.59 colon13 colorectal cancer 17352371 12517451 0.72 colon16 colorectalcancer 15470656 11210513 0.72 colon17 colorectal cancer 1510155710590748 0.70 colon19 colorectal cancer 18441208 12503926 0.68 BR5-1 §breast cancer 17826666 13542700 0.76 BR5-2 § breast cancer 1774617613004851 0.73 BR7-1 § breast cancer 16963664 13160842 0.78 BR7-2 §breast cancer 15495003 12100951 0.78 BR13 breast cancer 2138247316015986 0.75 BR14 breast cancer 18668112 14613260 0.78 HBV268 HBV8730571 5106519 0.58 HBV334 HBV 11838111 7848078 0.66 HBV374 HBV14896634 11099981 0.75 HBV397 HBV 12127855 8416798 0.69 HBV455 HBV12796382 9001735 0.70 HBV640 HBV 10040349 6062886 0.60 HBV646 HBV9665264 5002160 0.52 § Technical duplicate. † Unenriched input DNA

Cell-free 5hmC was sequenced from eight healthy individuals (Tables 1and 2). 5hmC from whole blood gDNA was also sequenced from two of theindividuals, because lysed blood cells can be a major contributor to thecell-free nucleic acid. Genome-scale profiles showed that the cell-free5hmC distributions are nearly identical between healthy individuals andare clearly distinguishable from both the whole blood 5hmC distributionand the input cfDNA (FIG. 6A). Previous studies of 5hmC in mouse andhuman tissues showed that the majority of 5hmC resides in the genebodies and promoter proximal regions of the genome (Mellen et al Cell2012 151: 1417-1430; Thomson Genome Biol. 2012 13, R93). Genome-wideanalysis of hMRs in our cfDNA data showed that a majority (80%) areintragenic with most enrichment in exons (observed to expected,o/e=7.29), and depletion in intergenic regions (o/e=0.46), consistentwith that in whole blood (FIGS. 6B-6C) and in other tissues. Theenrichment of 5hmC in gene bodies is known to be correlated withtranscriptional activity in tissues such as the brain and liver (see,e.g., Mellen et al Cell 2012 151: 1417-1430; Thomson Genome Biol. 201213, R93). To determine whether this relationship holds in cfDNA, weperformed sequencing of the cell-free RNA from the same individual. Bydividing genes into three groups according to their cell-free expressionand plotting the average cell-free 5hmC profile alone gene bodies(metagene analysis), it was discovered that 5hmC is enriched in andaround gene bodies of more highly expressed genes (FIG. 1C). Theseresults supported that cell-free 5hmC is a collection from varioustissue types and contains information from tissues other than the blood.

TABLE 2 Clinical information for healthy samples. sample ID gender age10 female 53 11 female 66 27 female 66 35 male 51 36 male 73 38o female70 38 female 64 39o female 49

Because cell-free 5hmC were mostly enriched in the intragenic regions,genic 5hmC fragments per kilobase of gene per million mapped reads(FPKM) was used to compare the cell-free hydroxymethylome with the wholeblood hydroxymethylome. Indeed, unbiased analysis of genic 5hmC usingt-distributed stochastic neighbor embedding (tSNE)21 showed strongseparation between the cell-free and whole blood samples (FIG. 6D). Thelimma package (Ritchie, et al Nucleic Acids Res. 2015: 43, e47) was usedto identify 2,082 differentially hydromethylated genes between wholeblood and cell-free samples (q-values (Benjamini and Hochberg adjustedp-values)<0.01, fold change>2, FIG. 7A). Notably, the 735 blood-specific5hmC enriched genes showed increased expression in whole blood comparedto the 1,347 cell-free-specific 5hmC enriched genes (p-value<2.2×10⁻¹⁶,Welch t-test) (FIG. 7B). In agreement with the differential expression,Gene Ontology (GO) analysis of blood-specific 5hmC enriched genes mainlyidentified blood cell-related processes (FIG. 7C), whereascell-free-specific 5hmC enriched genes identified much more diversebiological processes (FIG. 7D). Examples of whole blood-specific (FPR1,FPR2) and cell-free-specific (GLP1R) 5hmC enriched genes are shown inFIG. 7E. Together, these results reinforce the concept that all tissuescontribute 5hmC to cfDNA and that measurement of this is a rough proxyfor gene expression.

To explore the diagnostic potential of cell-free 5hmC, the method wasapplied to sequence cfDNA of a panel of 49 treatment-naïve primarycancer patients, including 15 lung cancer, 10 hepatocellular carcinoma(HCC), 7 pancreatic cancer, 4 glioblastoma (GBM), 5 gastric cancer, 4colorectal cancer, 4 breast cancer patients (Table 3-9, below). Thesepatients vary from early stage cancer to late stage metastatic cancer.In lung cancer, we observed a progressive global loss of 5hmC enrichmentfrom early stage non-metastatic lung cancer to late stage metastaticlung cancer compared to healthy cfDNA, and it gradually resembled thatof the unenriched input cfDNA (FIG. 2A). Unbiased gene body analysisusing tSNE also showed a stage-dependent migration of the lung cancerprofile from the healthy profile into one resembling the unenrichedinput cfDNA (FIG. 8A). Notably, even the early stage lung cancer samplesare highly separated from the healthy samples (FIG. 8A). The globalhypohydroxymethylome events were further confirmed using other metrics.First, most differential genes in metastatic lung cancer (q-values<1e-7,1,159 genes) showed stage-dependent depletion of 5hmC compared tohealthy samples (FIG. 2B). Second, the metagene profile showed astage-dependent depletion of gene body 5hmC signal and resemblance ofthe unenriched input cfDNA (FIG. 8B). Third, there is a dramaticdecrease in the number of hMRs identified in lung cancer, especially inmetastatic lung cancer compared to healthy and other cancer samples(FIG. 2C). These data confirmed the stage-dependent global loss of 5hmClevels in lung cancer cfDNA.

TABLE 3 Clinical information for lung cancer samples. sample ID categoryTNM stage gender age lung395 non-metastatic lung cancer T4N2Mx IIIfemale 62 lung419 non-metastatic lung cancer T1N2M0G2 IIIa female 53lung492 non-metastatic lung cancer T2N0M0 I male 55 lung493non-metastatic lung cancer T1N3M0 IV female 66 lung496 non-metastaticlung cancer T3N1M0 IIIa male 68 lung512 non-metastatic lung cancer — —female 67 lung513 non-metastatic lung cancer T2N1M0 I-II male 47 lung514non-metastatic lung cancer T2N0M0 I-II female 57 lung515 non-metastaticlung cancer cT3N1M0 IIIA male 52 lung293 metastatic lung cancer cT4N3M1aIV female 52 lung323 metastatic lung cancer TxN2M1 IV female 68 lung324metastatic lung cancer TxNxMl IV male 56 lung417 § metastatic lungcancer — — male 62 lung418 metastatic lung cancer TxN3Mx IIIb-IV male 59lung517 metastatic lung cancer cT4N2M1b IV male 68 All are non-smallcell lung cancer samples unless otherwise noted. § Small cell lungcancer.

TABLE 4 Clinical information for HCC samples. sample ID category TNMtumor size (cm) gender age HBV268 HBV — — male 36 HBV334 HBV — — female55 HBV374 HBV — — female 45 HBV397 HBV — — female 51 HBV455 HBV — —female 66 HBV640 HBV — — female 49 HBV646 HBV — — male 60 HCC150 HCCpre-op pT1 pNX pMX 3.1 § male 76 HCC256 HCC pre-op pT1 pNX pMX 15 × 9 male 80 HCC260 HCC pre-op pT1 pNX pMX 1.3 § male 68 HCC290 HCC pre-op —10 × 13 × 18 male 68 HCC320 HCC pre-op — multifocal female 70 HCC628 HCCpre-op pT1 1.8 § male 43 HCC285 HCC pre-op pT3N0M0   8 § male 73 HCC324HCC post-op — — 73 HCC237 HCC pre-op pT2 pNX pMX 4.1 § male 52 HCC241HCC post-op — — 52 HCC341 HCC recurrence —   3 × 1.2 53 HCC195 HCCpre-op pT1 pNX pM0 — male 44 HCC234 HCC pre-op — 1.6 § 44 HCC626 HCCrecurrence pT1 pNX pM0 1.7 × 1.7 × 1.0 50 HCC647 HCC post-op — — 53HCC46 HCC pre-op pT2 pNX pMX 2.8 § male 69 HCC73 HCC post-op — — 69HCC398 HCC follow-up — — 72 HCC489 HCC recurrence — 2.2 § 73 § ingreatest dimension.

TABLE 5 Clinical information for pancreatic cancer samples. sample IDTNM stage metastasis to gender age pancreatic9 T3N0M1 IV liver male 76pancreatic15 T1N0M0 IA — male 64 pancreatic22 T4N1M0 III — female 71pancreatic27 T4N1M1 IV abdominal wall, omentum male 55 pancreatic68T3N0M1 IV liver male 63 pancreatic69 T3N0M0 IIA — male 66 pancreatic75T3N0M0 IIA — male 54

TABLE 6 Clinical information for GBM samples. sample ID stage gender ageGBM57 IV female 52 GBM58 IV male 71 GBM66 IV male 81 GBM76 IV male 59

TABLE 7 Clinical information for gastric cancer samples. sample ID TNMstage gender age stomach1 T2N1M0 II a male 67 stomach2 T4aN3bM0 III cmale 54 stomach3 T1aN0M0 I a male 68 stomach4 T4bN0M0 III b male 70stomach8 T1bN0M0 I a male 65

TABLE 8 Clinical information for colorectal cancer samples. sample IDTNM stage gender age colon13 T4N0M0 II male 54 colon16 T3N0M0 II female57 colon17 T4N0M1 IV male 52 colon19 pT4N1M1 IV female 62

TABLE 9 Clinical information for breast cancer samples. sample ID tumorsize (cm) tumor grade age BR5 2.5 2 54 BR7 1.2 1 71 BR13 1 2 58 BR14 1.91 61

It should be noted that the global loss of 5hmC enrichment seen in lungcancer cfDNA is not due to the failure of our enrichment method, as thespike-in control in all samples including the lung cancer samples showedhigh enrichment of 5hmC-containing DNA (FIG. 8C). It is also aphenomenon unique to lung cancer that is not observed in other cancerswe tested, evidenced by the number of hMRs (FIG. 2C) and the metageneprofiles (FIG. 8B). Examples of 5hmC depleted genes in lung cancer areshown in FIG. 2D and FIG. 8D. Lung cancer tissue may have a low level of5hmC compared to normal lung tissue and lung may have a relatively largecontribution to cfDNA. It is plausible that lung cancer, especiallymetastatic lung cancer, causes large quantities of hypohydroxymethylatedgDNA to be released into cfDNA, effectively diluting the cfDNA andleading to the depletion of 5hmC in the cell-free 5hmC landscape.Alternatively or in combination, the cfDNA hypohydroxymethylation couldoriginate from blood gDNA hypohydroxymethylation observed in metastaticlung cancer patients as recently reported. Taken together these resultsdemonstrated that cell-free 5hmC sequencing can be used for early lungcancer detection as well as monitoring lung cancer progression andmetastasis.

For HCC, cell-free 5hmC from seven patients with hepatitis B (HBV)infection was sequence, because most HCC cases are secondary to viralhepatitis infections (Table 4). Unbiased gene level analysis by tSNErevealed that there is a gradual change of cell-free 5hmC from healthyto HBV and then to HCC, mirroring the disease development (FIG. 3A).HCC-specific differential genes (q-values<0.001, fold change>1.41, 1,006genes) could separate HCC from healthy and most of the HBV samples (FIG.3B). Both HCC-specific enriched and depleted genes can be identifiedcompared to other cfDNA samples (FIG. 3B), and the enriched genes (379genes) showed increased expression in liver tissue compared to thedepleted genes (637 genes) (p-values<2.2×10⁻¹⁶, Welch t-test) (FIG. 9A),consistent with the permissive effect of 5hmC on gene expression. Anexample of HCC-specific 5hmC enriched genes is AHSG, a secreted proteinhighly expressed in the liver (FIG. 3C and FIGS. 9B-9C), and an exampleof HCC-specific 5hmC depleted genes is MTBP, which was reported toinhibit migration and metastasis of HCC and was downregulated in HCCtissues (FIG. 3d and Extended Data FIG. 5d ). Together, these resultspoint to a model where virus infection and HCC development lead to agradual damage of liver tissue and increased presentation of liver DNAin the blood.

To further explore the potential of cell-free 5hmC for monitoringtreatment and disease progression, four of the HCC patients werefollowed. These patients underwent surgical resection, out of whichthree of them had recurrent disease (Table 4). Analysis of serial plasmasamples from these patients (pre-operation/pre-op;post-operation/post-op; and recurrence) with tSNE revealed that post-opsamples clustered with healthy samples, whereas the recurrence samplesclustered with HCC (FIG. 3E). This pattern was also reflected by changesin the 5hmC FPKM of AHSG and MTBP (FIGS. 3C-3D). As an example of usingcell-free 5hmC for tracking HCC treatment and progression, we employedlinear discriminant analysis (LDA) to define a linear combination of theHCC-specific differential genes (FIG. 3B) into to a single value (theHCC score) that best separated the pre-op HCC samples from the healthyand HBV samples. We then calculated the HCC score for the post-op andrecurrence HCC samples, and showed that the HCC score can accuratelytrack the treatment and recurrence states (FIG. 5E). Together, theseresults demonstrate that cell-free 5hmC sequencing is a powerful tool todetect HCC, as well as monitor treatment outcome and disease recurrence.

It was also found that pancreatic cancer produces drastic changes in itscell-free hydroxymethylome, even in some early stage pancreatic cancerpatients (Table 5). Like HCC, pancreatic cancer lead to both upregulatedand downregulated 5hmC genes compared to healthy individuals(q-value<0.01, fold change>2, 713 genes) (FIG. 10A). Examples ofpancreatic cancer-specific 5hmC enriched and depleted genes comparedother cfDNA samples are shown in FIGS. 6B-6E. Our results suggest thatcell-free 5hmC sequencing can be potentially valuable for earlydetection of pancreatic cancer.

Although there has been great interest in using cfDNA as a “liquidbiopsy” for cancer detection, it has been challenging to identify theorigin of tumor cfDNA and hence the location of the tumor. Our resultsthat analysis of cell-free 5hmC could solve this problem because tSNEanalysis of all seven cancer types shows that that lung cancer, HCC, andpancreatic cancer showed distinct signatures and could be readilyseparated from each other and healthy samples (FIG. 4A). The other fourtypes of cancer displayed relatively minor changes compared to thehealthy samples. Using other features such as the promotor region (5 kbupstream of the transcription start site (TSS)) showed similar patterns(FIG. 11A). It is noted that no particular cancer type that was testedresembled the whole blood profile (FIG. 11B), suggesting that the bloodcell contamination is not a significant source of variation. Allpatients in the panel fall in the same age range as the healthyindividuals (FIG. 11C, and Tables 2-9), therefore age is unlikely to bea confounding factor. No batch effect was observed (FIG. 11D).

To further demonstrate the power of cfDNA 5hmC as a biomarkers topredict cancer types two widely used machine learning methods, theNormal mixture model and Random Forest, were employed. The predictionwas focused on HCC, pancreatic cancer, non-metastatic and metastaticlung cancer. Based on three rules (see below), identified 90 genes(Table 10) were identified whose average gene body 5hmC levels couldeither distinguish cancer groups from healthy groups or between cancergroups.

TABLE 10A 90 gene body feature set used for cancer prediction. ASF1BGLP2R C2orf62 SPATA31E1 SLAMF7 INSC LINC00304 LOC100507410 DUSP26 IRF7RNF34 AUNIP TTC24 ADAMTS4 TPM4 DUSP28 RNF122 SLC9A3R2 LOC255411 ATP6V0A2SYT2 COMMD6 POU4F3 SYT11 RFPL3 KIF16B SHISA2 EPPIN-WFDC6 CPLX2 SIGLEC10FLJ31813 RAG1 SLC25A46 FLJ16779 ZNF284 GBX1 PAIP1 PTPN2 APCDD1L-AS1SOX18 ZNF850 C8orf22 ZNF800 TMEM168 GMCL1P1 CLDN15 RDH11 ZNF423 PODXL2ABRACL LOC100507250 NRADDP BAGE EPN3 THAP7-AS1 GSTP1 CTRC TRAM1 ALDH1A3PSMG1 MAFF AMOTL1 IGSF9B CC2D1B HOXC5 LHX5 FENDRR LOC100128946 PAX1 TPOCRP LOC100131234 KIF20B NPAS4 STXBP3 ARL6IP6 TMEM65 ETAA1 GNPDA2 ALG10BDAZL LINC00158 TMX2 RBM14-RBM4 SORD HMOX2 LDHD ZNF444 AGFG2 DHRS3

In a second analysis using a different method, the gene bodies listed inTable 10B were identified as being predictive for cancer.

TABLE 10B Top gene body feature set used for cancer prediction CLDN15SLC25A47 ZRANB2 LOC100506963 STXBP3 GPR26 P2RX2 LOC100507410 LHX5 HOXC5FAM96A CALCB RNF223 SHISA2 SLAMF7 PAX1 DACH1 LOC100128946 ASF1B KIF16BSSR2 LARS DHRS3 CCDC33 GMCL1P1 COMMD6 SPATA31E1 ABRACL SAMD11 UBQLN4TCEA3 SYT2 INSL4 RAG1 CCNL2 CRP DDX11L1 LOC729737 WASH7P LOC100132287

The target loci analyzed in the method described above may include oneor more (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more, e.g., 15 or moreor 20 or more of the gene bodies listed in Tables 10A and/or 10B, asshown above.

In addition to gene body, the 5hmC on non-coding regions couldpotentially serve as biomarkers in predicting cancer types. Another setof features was designed by investigating each of the 2 kb windows ofthe entire genome and identified differential hMRs (DhMRs) for eachcancer type. 17 marker DhMRs were identified for the four distinctivecancer groups (Table 11A).

TABLE 11A 17 DhMR feature set used for cancer prediction chr9:88044001-88046000 chr1: 63972001-63974000 chr1: 114670001-114672000chr2: 133888001-133890000 chr1: 37824001-37826000 chr8:53686001-53688000 chr2: 49900001-49902000 chr5: 103492001-103494000chr2: 137676001-137678000 chr2: 200922001-200924000 chr2:41780001-41782000 chr3: 137070001-137072000 chr7: 11020001-11022000chr4: 90790001-90792000 chr3: 93728001-93730000 chr3: 87312001-87314000chr6: 45304001-45306000

In a second analysis using a different method, the gene bodies listed inTable 10B were identified as being predictive for cancer.

TABLE 11B Top DhMR feature set used for cancer prediction chr4:90790001-90792000 chr6: 45304001-45306000 chr1: 169422001-169424000chr1: 67584001-67586000 chr5: 103492001-103494000 chr3:87312001-87314000 chr2: 219148001-219150000 chr1: 198222001-198224000chr8: 53686001-53688000 chr1: 239846001-239848000 chr3:23318001-23320000 chr6: 122406001-122408000 chr9: 3496001-3498000 chr1:24806001-24808000 chr8: 69672001-69674000 chr2: 49900001-49902000 chr3:107894001-107896000 chr8: 42934001-42936000 chr3: 17352001-17354000chr6: 157286001-157288000 chr3: 108506001-108508000 chr4:39342001-39344000 chr6: 129198001-129200000 chr3: 137070001-137072000chr1: 59248001-59250000 chr5: 83076001-83078000 chr3: 93728001-93730000chr2: 213134001-213136000 chr5: 39530001-39532000 chr1: 3234001-3236000chr1: 37824001-37826000 chr6: 156800001-156802000 chr7:13364001-13366000 chr1: 77664001-77666000 chr2: 154460001-154462000chr2: 41780001-41782000

The target loci analyzed in the method described above may include oneor more (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more, e.g., 15 or moreor 20 or more of the gene bodies listed in Tables 11A and/or 11B, asshown above.

The two machine learning algorithms were trained using either 90 genesor 17 DhMRs as features and the prediction accuracy was evaluated withleave-one-out (LOO) cross-validation. The Normal mixture model basedpredictor (Mclust) had LOO cross-validation error rates of 10% and 5%,when using gene body and DhMRs as features, respectively (FIG. 4B andFIGS. 12A-12B). Mclust-based dimensional reduction showed clearboundaries between the groups (FIG. 12C). The Random Forest predictorachieved LOO cross-validation error rates of 5% and 0%, when using genebody and DhMRs as features, respectively (FIG. 4B). Distinct 5hmCprofiles in different cancer types of several DhMRs with high variableimportance to random forest prediction model could be observed (FIGS.12D-12E). Finally, Cohen's kappa was used to evaluate the concordancerate between different prediction models. All combinations showed highagreement (Cohen's kappa˜0.9) in inter-classifier comparison and whencomparing with the actual classification (FIG. 4C). FIGS. 12F and 12Gshow the variable importance for gene bodies and DhMRS, obtained using adifferent method. These results demonstrate that cell-free 5hmC can beused for cancer diagnostics and staging.

It will also be recognized by those skilled in the art that, while theinvention has been described above in terms of preferred embodiments, itis not limited thereto. Various features and aspects of the abovedescribed invention may be used individually or jointly. Further,although the invention has been described in the context of itsimplementation in a particular environment, and for particularapplications (e.g. cfDNA analysis) those skilled in the art willrecognize that its usefulness is not limited thereto and that thepresent invention can be beneficially utilized in any number ofenvironments and implementations where it is desirable to examinehydroxymethylation. Accordingly, the claims set forth below should beconstrued in view of the full breadth and spirit of the invention asdisclosed herein.

What is claimed is:
 1. A kit for analyzing cfDNA, comprising: DNAβ-glucosyltransferase; UDP glucose modified with a chemoselective group;an adaptor comprising at least one molecular barcode; and a spiked-incontrol comprising three amplicons synthesized from a cocktail of dATP,dGTP, dTTP, and (1) dCTP, (2) dmCTP, or (3) dhmCTP and dCTP.
 2. The kitof claim 1, wherein the at least one molecular barcode comprises asample identifier sequence and a molecule identifier sequence.